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Preface 


This book is aimed at senior undergraduates and graduate students in Engi- 
neering, Science, Mathematics, and Computing. It expects familiarity with 
calculus, probability theory, and linear algebra as taught in a first- or second- 
year undergraduate course on mathematics for scientists and engineers. 

Conventional courses on information theory cover not only the beauti- 
ful theoretical ideas of Shannon, but also practical solutions to communica- 
tion problems. This book goes further, bringing in Bayesian data modelling, 
Monte Carlo methods, variational methods, clustering algorithms, and neural 
networks. 

Why unify information theory and machine learning? Because they are 
two sides of the same coin. In the 1960s, a single field, cybernetics, was 
populated by information theorists, computer scientists, and neuroscientists, 
all studying common problems. Information theory and machine learning still 
belong together. Brains are the ultimate compression and communication 
systems. And the state-of-the-art algorithms for both data compression and 
error-correcting codes use the same tools as machine learning. 


How to use this book 


The essential dependencies between chapters are indicated in the figure on the 
next page. An arrow from one chapter to another indicates that the second 
chapter requires some of the first. 

Within Parts I, II, IV, and V of this book, chapters on advanced or optional 
topics are towards the end. All chapters of Part III are optional on a first 
reading, except perhaps for Chapter 16 (Message Passing). 

The same system sometimes applies within a chapter: the final sections of- 
ten deal with advanced topics that can be skipped on a first reading. For exam- 
ple in two key chapters — Chapter 4 (The Source Coding Theorem) and Chap- 
ter 10 (The Noisy-Channel Coding Theorem) — the first-time reader should 
detour at section 4.5 and section 10.4 respectively. 

Pages vii—x show a few ways to use this book. First, I give the roadmap for 
a course that I teach in Cambridge: ‘Information theory, pattern recognition, 
and neural networks’. The book is also intended as a textbook for traditional 
courses in information theory. The second roadmap shows the chapters for an 
introductory information theory course and the third for a course aimed at an 
understanding of state-of-the-art error-correcting codes. The fourth roadmap 
shows how to use the text in a conventional course on machine learning. 
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Preface 


About the exercises 


You can understand a subject only by creating it for yourself. The exercises 
play an essential role in this book. For guidance, each has a rating (similar to 
that used by Knuth (1968)) from 1 to 5 to indicate its difficulty. 

In addition, exercises that are especially recommended are marked by a 

> marginal encouraging rat. Some exercises that require the use of a computer 
are marked with a C. 

Answers to many exercises are provided. Use them wisely. Where a solu- 
tion is provided, this is indicated by including its page number alongside the 
difficulty rating. 

Solutions to many of the other exercises will be supplied to instructors 
using this book in their teaching; please email solutions@cambridge. org. 


Summary of codes for exercises 


Especially recommended [1] Simple (one minute) 
> [2] Medium (quarter hour) 
> Recommended [3] Moderately hard 
C Parts require a computer [4] Hard 
[p.42] Solution provided on page 42 [5] Research project 








Internet resources 


The website 
http://www.inference.phy.cam.ac.uk/mackay/itila 


contains several resources: 


1. Software. Teaching software that I use in lectures, interactive software, 
and research software, written in perl, octave, tcl, C, and gnuplot. 
Also some animations. 


2. Corrections to the book. Thank you in advance for emailing these! 


3. This book. The book is provided in postscript, pdf, and djvu formats 
for on-screen viewing. The same copyright restrictions apply as to a 
normal book. 


About this edition 


This is the fourth printing of the first edition. In the second printing, the 
design of the book was altered slightly. Page-numbering generally remained 
unchanged, except in chapters 1, 6, and 28, where a few paragraphs, figures, 
and equations moved around. All equation, section, and exercise numbers 
were unchanged. In the third printing, chapter 8 was renamed ‘Dependent 
Random Variables’, instead of ‘Correlated’, which was sloppy. 
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About Chapter 1 


In the first chapter, you will need to be familiar with the binomial distribution. 
And to solve the exercises in the text — which I urge you to do — you will need 


to know Stirling’s approximation for the factorial function, æ! ~ x” e~*, and 
be able to apply it to ~ = Went These topics are reviewed below. Unfamiliar notation? 


See Appendix A, p.598. 
The binomial distribution 


Example 1.1. A bent coin has probability f of coming up heads. The coin is 
tossed N times. What is the probability distribution of the number of 
heads, r? What are the mean and variance of r? 





Solution. The number of heads has a binomial distribution. ws 
ee 
0.15 4 


P(r| f, N) = Gra- p. (1.1) 01 | 


0.05 4 











The mean, E[r], and variance, var[r], of this distribution are defined by 


N : Zant 
Era Y PELAN)" (1.2) Gistibution P(r] f =0 3, N=10). 
r=0 
var[r] = € [e — Elr))’] (1.3) 
N 
= Ef’ (Er)? = So P| FN)? — (El)? . (1.4) 
r=0 


Rather than evaluating the sums over r in (1.2) and (1.4) directly, it is easiest 
to obtain the mean and variance by noting that r is the sum of N independent 
random variables, namely, the number of heads in the first toss (which is either 
zero or one), the number of heads in the second toss, and so forth. In general, 


Elx +y] E(x] + Ely] for any random variables x and y; 
varz +y] = var[z] + var[y] if x and y are independent. 


(1.5) 





So the mean of r is the sum of the means of those random variables, and the 
variance of r is the sum of their variances. The mean number of heads in a 
single toss is f x 1+ (1— f) x 0 = f, and the variance of the number of heads 
in a single toss is 


[fxP+-f)x0])-P=f-fP=fl—f), (1.6) 


so the mean and variance of r are: 





Elr] = Nf and var[r] = Nf(1— f). (1.7) 











2 


Approximating x! and @) 


Let’s derive Stirling’s approximation by an unconventional route. We start 
from the Poisson distribution with mean A, 


AT 
P(r|A)=e*— re {0,1,2,...}. (1.8) 
r! 
For large A, this distribution is well approximated — at least in the vicinity of 


rœ \—by a Gaussian distribution with mean A and variance A: 











Ar 1 = (r=)? 
e^ x e 2) (1.9) 
r! 2TÀ 
Let’s plug r = A into this formula, then rearrange it. 
M 1 
=X 
e°— x 1.10 
A! 2TA ( ) 
> M ~ MerAVInr. (1.11) 
This is Stirling’s approximation for the factorial function. 
al ~ ate * Vine & Ina! x cna—2+ 4ln2rz. (1.12) 


We have derived not only the leading order behaviour, x! ~ x” e~*, but also, 
at no cost, the next-order correction term 27x. We now apply Stirling’s 
approximation to In œ: 





N N! N N 
l In ———. ~ (N-r)l In—. 1.13 
a(*) “(Nir ( Gp ge ( ) 
Since all the terms in this equation are logarithms, this result can be rewritten 
in any base. We will denote natural logarithms (loge) by ‘In’, and logarithms 
to base 2 (logs) by ‘log’. 

If we introduce the binary entropy function, 





1 
A(x) = zlog- + (1-2) log =a): (1.14) 
then we can rewrite the approximation (1.13) as 
N 
log œ~ NHə(r/N), (1.15) 
r 
or, equivalently, 
i) œ~ QNHa(r/N) | (1.16) 
r 


If we need a more accurate approximation, we can include terms of the next 
order from Stirling’s approximation (1.12): 





ate a . (1.17) 


N 1 
toe (3) ~ NHz(r/N) — 3log [anv WN 
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Figure 1.2. The Poisson 
distribution P(r |= 15). 
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Figure 1.3. The binary entropy 
function. 
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Introduction to Information Theory 


The fundamental problem of communication is that of reproducing 
at one point either exactly or approximately a message selected at 
another point. 


(Claude Shannon, 1948) 


In the first half of this book we study how to measure information content; we 
learn how to compress data; and we learn how to communicate perfectly over 
imperfect communication channels. 

We start by getting a feeling for this last problem. 


> 1.1 How can we achieve perfect communication over an imperfect, 
noisy communication channel? 


Some examples of noisy communication channels are: peas 


3 > modem 
line 


modem > 
e an analogue telephone line, over which two modems communicate digital 
information; 


Galileo — radio +. Earth 


e the radio communication link from Galileo, the Jupiter-orbiting space- Waves 
craft, to earth; 
daughter 
e reproducing cells, in which the daughter cells’ DNA contains information parent cell 
from the parent cells; cell 
; A daughter 
e a disk drive. cell 
The last example shows that communication doesn’t have to involve informa- computer _. disk _, computer 
tion going from one place to another. When we write a file on a disk drive, memory drive memory 


we'll read it off in the same location — but at a later time. 

These channels are noisy. A telephone line suffers from cross-talk with 
other lines; the hardware in the line distorts and adds noise to the transmitted 
signal. The deep space network that listens to Galileo’s puny transmitter 
receives background radiation from terrestrial and cosmic sources. DNA is 
subject to mutations and damage. A disk drive, which writes a binary digit 
(a one or zero, also known as a bit) by aligning a patch of magnetic material 
in one of two orientations, may later fail to read out the stored binary digit: 
the patch of material might spontaneously flip magnetization, or a glitch of 
background noise might cause the reading circuit to report the wrong value 
for the binary digit, or the writing head might not induce the magnetization 
in the first place because of interference from neighbouring bits. 

In all these cases, if we transmit data, e.g., a string of bits, over the channel, 
there is some probability that the received message will not be identical to the 


3 
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4 1 — Introduction to Information Theory 


transmitted message. We would prefer to have a communication channel for 
which this probability was zero — or so close to zero that for practical purposes 
it is indistinguishable from zero. 

Let’s consider a noisy disk drive that transmits each bit correctly with 
probability (1— f) and incorrectly with probability f. This model communi- 
cation channel is known as the binary symmetric channel (figure 1.4). 


Figure 1.4. The binary symmetric 
xX = = 1—f; P(y=0|¢=1) = f; channel. The transmitted symbol 
aN P(y=1|x=0) = f; Puster a Ta is x and the received symbol y. 
1—1 (y | ) f (y | ) f The noise level, the probability 
that a bit is flipped, is f. 


Í 
ros) 
= 
< 

D 
8 

| 
2 
| 


Figure 1.5. A binary data 
sequence of length 10000 
transmitted over a binary 
symmetric channel with noise 
level f = 0.1. [Dilbert image 
Copyright©1997 United Feature 
Syndicate, Inc., used with 
permission. ] 





As an example, let’s imagine that f = 0.1, that is, ten per cent of the bits are 
flipped (figure 1.5). A useful disk drive would flip no bits at all in its entire 
lifetime. If we expect to read and write a gigabyte per day for ten years, we 
require a bit error probability of the order of 10715, or smaller. There are two 
approaches to this goal. 


The physical solution 


The physical solution is to improve the physical characteristics of the commu- 
nication channel to reduce its error probability. We could improve our disk 
drive by 


1. using more reliable components in its circuitry; 


2. evacuating the air from the disk enclosure so as to eliminate the turbu- 
lence that perturbs the reading head from the track; 


3. using a larger magnetic patch to represent each bit; or 


4. using higher-power signals or cooling the circuitry in order to reduce 
thermal noise. 


These physical modifications typically increase the cost of the communication 
channel. 


The ‘system’ solution 


Information theory and coding theory offer an alternative (and much more ex- 
citing) approach: we accept the given noisy channel as it is and add communi- 
cation systems to it so that we can detect and correct the errors introduced by 
the channel. As shown in figure 1.6, we add an encoder before the channel and 
a decoder after it. The encoder encodes the source message s into a transmit- 
ted message t, adding redundancy to the original message in some way. The 
channel adds noise to the transmitted message, yielding a received message r. 
The decoder uses the known redundancy introduced by the encoding system 
to infer both the original signal s and the added noise. 
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1.2: Error-correcting codes for the binary symmetric channel 5 
Source Figure 1.6. The ‘system’ solution 
for achieving reliable 
communication over a noisy 
8 S channel. The encoding system 








introduces systematic redundancy 
into the transmitted vector t. The 
decoding system uses this known 
redundancy to deduce from the 
received vector r both the original 


t Nois r source vector and the noise 
y introduced by the channel. 
channel 


Whereas physical solutions give incremental channel improvements only at 
an ever-increasing cost, system solutions can turn noisy channels into reliable 
communication channels with the only cost being a computational requirement 
at the encoder and decoder. 

Information theory is concerned with the theoretical limitations and po- 
tentials of such systems. ‘What is the best error-correcting performance we 
could achieve?’ 

Coding theory is concerned with the creation of practical encoding and 
decoding systems. 


Encoder Decoder 














> 1.2 Error-correcting codes for the binary symmetric channel 


We now consider examples of encoding and decoding systems. What is the 
simplest way to add useful redundancy to a transmission? [To make the rules 
of the game clear: we want to be able to detect and correct errors; and re- 
transmission is not an option. We get only one chance to encode, transmit, 
and decode.] 


Repetition codes 


Source Transmitted 
A straightforward idea is to repeat every bit of the message a prearranged sequence sequence 
number of times — for example, three times, as shown in table 1.7. We call s t 
this repetition code ‘R3’. 0 000 
Imagine that we transmit the source message 1 111 


s=0010110 
Table 1.7. The repetition code R3. 


over a binary symmetric channel with noise level f = 0.1 using this repetition 
code. We can describe the channel as ‘adding’ a sparse noise vector n to the 
transmitted vector — adding in modulo 2 arithmetic, i.e., the binary algebra 
in which 1+1=0. A possible noise vector n and received vector r = t + n are 
shown in figure 1.8. 


Figure 1.8. An example 
(0) 0 1 0 1 1 0 transmission using Rg. 


DN DSSS aa 
000 000 111 000 111 111 000 


000 001 000 000 101 000 000 
000 001 111 000 010 111 000 


s Be n 


How should we decode this received vector? The optimal algorithm looks 
at the received bits three at a time and takes a majority vote (algorithm 1.9). 
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6 1 — Introduction to Information Theory 





Algorithm 1.9. Majority-vote 


Received sequence r Likelihood ratio PEER Decoded sequence decoding algorithm for R3. Also 
uaua oo shown are the likelihood ratios 
000 (1.23), assuming the channel is a 
001 binary symmetric channel; 
010 y=(1-f)/f. 


100 
101 
110 
011 
111 





At the risk of explaining the obvious, let’s prove this result. The optimal 
decoding decision (optimal in the sense of having the smallest probability of 
being wrong) is to find which value of s is most probable, given r. Consider 
the decoding of a single bit s, which was encoded as t(s) and gave rise to three 
received bits r = rirar3. By Bayes’ theorem, the posterior probability of s is 


P(rirars | s)P(s) 








P(s s = 1.1 
(s|rirars) P(r ror3) ( 8) 
We can spell out the posterior probability of the two alternatives thus: 
P(rirer3 |s=1)P(s=1) 
P(s=1|r = N; 1.1 
(s=1|rirars) PEET ; (1.19) 
P(rir =0)P(s=0 
Poon ESS) (1.20) 


P(ri rors) 


This posterior probability is determined by two factors: the prior probability 
P(s), and the data-dependent term P(rir21r3 | s), which is called the likelihood 
of s. The normalizing constant P(r rgr3) needn’t be computed when finding the 
optimal decoding decision, which is to guess §=0 if P(s=0|r) > P(s=1|r), 
and §=1 otherwise. 





To find P(s=0|r) and P(s=1|r), we must make an assumption about the 
prior probabilities of the two hypotheses s =0 and s = 1, and we must make an 
assumption about the probability of r given s. We assume that the prior prob- 
abilities are equal: P(s=0) = P(s=1) = 0.5; then maximizing the posterior 
probability P(s |r) is equivalent to maximizing the likelihood P(r | s). And we 
assume that the channel is a binary symmetric channel with noise level f < 0.5, 
so that the likelihood is 








N 


P(r|s) = P(r|t(s)) = Il P(rn|tn(s)), (1.21) 


n=1 


where N = 3 is the number of transmitted bits in the block we are considering, 


and i 
Prali) ={ n a (1.22) 


Thus the likelihood ratio for the two hypotheses is 


wets P(tn |tn(1)) , 
BED -JI AAN (1.23) 


equals if rn = 1 and wo if rn = 0. The ratio 








P(rn|tn(1)) 
each factor GAON 
— (1-f) 


Y= a is greater than 1, since f < 0.5, so the winning hypothesis is the 
one with the most ‘votes’, each vote counting for a factor of y in the likelihood 
ratio. 


1.2: Error-correcting codes for the binary symmetric channel 


Thus the majority-vote decoder shown in algorithm 1.9 is the optimal decoder 
if we assume that the channel is a binary symmetric channel and that the two 
possible source messages 0 and 1 have equal prior probability. 


We now apply the majority vote decoder to the received vector of figure 1.8. 
The first three received bits are all 0, so we decode this triplet as a 0. In the 
second triplet of figure 1.8, there are two Os and one 1, so we decode this triplet 
as a 0 — which in this case corrects the error. Not all errors are corrected, 
however. If we are unlucky and two errors fall in a single block, as in the fifth 
triplet of figure 1.8, then the decoding rule gets the wrong answer, as shown 
in figure 1.10. 


s (0) (0) 1 0 1 1 (0) 
œ~ n~ ae a~ at ie aN 

t 000 000 000 111 111 000 

n 000 001 O 000 101 000 000 

r 000 001 000 010 111 000 
DRA S A A N 

s 0 0 1 0 (0) 1 (0) 

corrected errors * 
undetected errors * 


Exercise 1.2.12» P-16] 


Show that the error probability is reduced by the use of 
Rs by computing the error probability of this code for a binary symmetric 


channel with noise level f. 


The error probability is dominated by the probability that two bits in 
a block of three are flipped, which scales as f?. In the case of the binary 
symmetric channel with f = 0.1, the R3 code has a probability of error, after 
decoding, of pp œ 0.03 per bit. Figure 1.11 shows the result of transmitting a 
binary image over a binary symmetric channel using the repetition code. 


S ENCODER t CHANNEL r 
f=10% 


REDUNDA 


DECODER 


REDONDA 
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Figure 1.10. Decoding the received 
vector from figure 1.8. 


The exercise’s rating, e.g.‘[2]’, 
indicates its difficulty: ‘1’ 
exercises are the easiest. Exercises 
that are accompanied by a 
marginal rat are especially 
recommended. If a solution or 
partial solution is provided, the 
page is indicated after the 
difficulty rating; for example, this 
exercise’s solution is on page 16. 


$ 





Figure 1.11. Transmitting 10 000 
source bits over a binary 
symmetric channel with f = 10% 
using a repetition code and the 
majority vote decoding algorithm. 
The probability of decoded bit 
error has fallen to about 3%; the 
rate has fallen to 1/3. 
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8 
Kp a = Figure 1.12. Error probability pp 
01 - Rid ooi - RS. ar R1 versus rate for repetition codes 
Aea over a binary symmetric channel 
4 with f = 0.1. The right-hand 
0.08 + fi sh l ithmi 
1e-05 + more usetul codes gure shows p on a logarithmic 
Pb Jë scale. We would like the rate to 
0.06 4 18 be large and pp to be small. 
T 
0.06.7 16-10 48 
16 
0.02 + f 
i more useful codes th 
of Tp R61 
0 -pa T T T T T 1e-15 4 T T T T T 
0 


0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 
Rate Rate 


The repetition code Rg has therefore reduced the probability of error, as 
desired. Yet we have lost something: our rate of information transfer has 
fallen by a factor of three. So if we use a repetition code to communicate data 
over a telephone line, it will reduce the error frequency, but it will also reduce 
our communication rate. We will have to pay three times as much for each 
phone call. Similarly, we would need three of the original noisy gigabyte disk 
drives in order to create a one-gigabyte disk drive with pp = 0.03. 

Can we push the error probability lower, to the values required for a sell- 
able disk drive — 10715? We could achieve lower error probabilities by using 


repetition codes with more repetitions. 


Exercise 1.3.19 P-16] (a) Show that the probability of error of Ry, the repe- 


= tition code with N repetitions, is 
N 
N nm —n 
m= D (rap (1.24) 
n=(N+1)/2 
for odd N. 


(b) Assuming f = 0.1, which of the terms in this sum is the biggest? 
How much bigger is it than the second-biggest term? 

(c) Use Stirling’s approximation (p.2) to approximate the ~) in the 
largest term, and find, approximately, the probability of error of 
the repetition code with N repetitions. 

(d) Assuming f = 0.1, find how many repetitions are required to get 
the probability of error down to 10715. [Answer: about 60.] 


So to build a single gigabyte disk drive with the required reliability from noisy 
gigabyte drives with f = 0.1, we would need sixty of the noisy disk drives. 
The tradeoff between error probability and rate for repetition codes is shown 


in figure 1.12. 


Block codes — the (7,4) Hamming code 


We would like to communicate with tiny probability of error and at a substan- 
tial rate. Can we improve on repetition codes? What if we add redundancy to 
blocks of data instead of encoding one bit at a time? We now study a simple 


block code. 
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1.2: Error-correcting codes for the binary symmetric channel 9 


A block code is a rule for converting a sequence of source bits s, of length 
K, say, into a transmitted sequence t of length N bits. To add redundancy, 
we make N greater than K. In a linear block code, the extra N — K bits are 
linear functions of the original K bits; these extra bits are called parity-check 
bits. An example of a linear block code is the (7,4) Hamming code, which 
transmits N = 7 bits for every K = 4 source bits. 


fot oS ain Figure 1.13. Pictorial 
representation of encoding for the 
(7,4) Hamming code. 
Ke: N /\3 7 X 


a) | dy 


The encoding operation for the code is shown pictorially in figure 1.13. We 
arrange the seven transmitted bits in three intersecting circles. The first four 
transmitted bits, tytat3t4, are set equal to the four source bits, 51598384. The 
parity-check bits t5tgt7 are set so that the parity within each circle is even: 
the first parity-check bit is the parity of the first three source bits (that is, it 
is 0 if the sum of those bits is even, and 1 if the sum is odd); the second is 
the parity of the last three; and the third parity bit is the parity of source bits 
one, three and four. 

As an example, figure 1.13b shows the transmitted codeword for the case 
s = 1000. Table 1.14 shows the codewords generated by each of the 24 = 
sixteen settings of the four source bits. These codewords have the special 
property that any pair differ from each other in at least three bits. 


s t s t 5 t Ş t Table 1.14. The sixteen codewords 
= A T OC {t} of the (7,4) Hamming code. 

0000 0000000 0100 0100110 1000 1000101 1100 1100011 Any pair of codewords differ from 
0001 0001011 0101 0101101 1001 1001110 1101 1101000 each other in at least three bits: 


0010 0010111 0110 0110001 1010 1010010 1110 1110100 
0011 0011100 0111 0111010 1011 1011001 1111 1111111 


Because the Hamming code is a linear code, it can be written compactly in 
terms of matrices as follows. The transmitted codeword t is obtained from the 
source sequence s by a linear operation, 


t = G's, (1.25) 


where G is the generator matrix of the code, 


1 000 
0 10 0 
001 0 

G'=]0 00 1], (1.26) 
1 110 
O 1 1 1 
1 0 1 1 


and the encoding operation (1.25) uses modulo-2 arithmetic (1+1 = 0, 0+1 = 
1, etc.). 


In the encoding operation (1.25) I have assumed that s and t are column vectors. 
If instead they are row vectors, then this equation is replaced by 


t =sG, (1.27) 
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10 1 — Introduction to Information Theory 
where 
1000101 
O 1 O O 1 1 0 
ar 0010111 (1.28) 
oOo 0 O 1 O 1 1 


I find it easier to relate to the right-multiplication (1.25) than the left-multiplica- 
tion (1.27). Many coding theory texts use the left-multiplying conventions 
(1.27-1.28), however. 


The rows of the generator matrix (1.28) can be viewed as defining four basis 
vectors lying in a seven-dimensional binary space. The sixteen codewords are 
obtained by making all possible linear combinations of these vectors. 


Decoding the (7,4) Hamming code 


When we invent a more complex encoder s — t, the task of decoding the 
received vector r becomes less straightforward. Remember that any of the 
bits may have been flipped, including the parity bits. 

If we assume that the channel is a binary symmetric channel and that all 
source vectors are equiprobable, then the optimal decoder identifies the source 
vector s whose encoding t(s) differs from the received vector r in the fewest 
bits. [Refer to the likelihood function (1.23) to see why this is so.] We could 
solve the decoding problem by measuring how far r is from each of the sixteen 
codewords in table 1.14, then picking the closest. Is there a more efficient way 
of finding the most probable source vector? 


Syndrome decoding for the Hamming code 


For the (7,4) Hamming code there is a pictorial solution to the decoding 
problem, based on the encoding picture, figure 1.13. 

As a first example, let’s assume the transmission was t = 1000101 and the 
noise flips the second bit, so the received vector is r = 1000101 6 0100000 = 
1100101. We write the received vector into the three circles as shown in 
figure 1.15a, and look at each of the three circles to see whether its parity 
is even. The circles whose parity is not even are shown by dashed lines in 
figure 1.15b. The decoding task is to find the smallest set of flipped bits that 
can account for these violations of the parity rules. [The pattern of violations 
of the parity checks is called the syndrome, and can be written as a binary 
vector — for example, in figure 1.15b, the syndrome is z = (1,1,0), because 
the first two circles are ‘unhappy’ (parity 1) and the third circle is ‘happy’ 
(parity 0).] 

To solve the decoding task, we ask the question: can we find a unique bit 
that lies inside all the ‘unhappy’ circles and outside all the ‘happy’ circles? If 
so, the flipping of that bit would account for the observed syndrome. In the 
case shown in figure 1.15b, the bit rə lies inside the two unhappy circles and 
outside the happy circle; no other single bit has this property, so rə is the only 
single bit capable of explaining the syndrome. 

Let’s work through a couple more examples. Figure 1.15c shows what 
happens if one of the parity bits, ts, is flipped by the noise. Just one of the 
checks is violated. Only rs lies inside this unhappy circle and outside the other 
two happy circles, so rs is identified as the only single bit capable of explaining 
the syndrome. 

If the central bit r3 is received flipped, figure 1.15d shows that all three 
checks are violated; only r3 lies inside all three circles, so r3 is identified as 
the suspect bit. 


1.2: Error-correcting codes for the binary symmetric channel 


(a 
(CY 


\ r D 
(a) Vey: y 


010 011 100 101 110 111 


Syndrome z 000 001 
Unflip this bit 


none r7 r6 T4 r5 rı r2 ES 


If you try flipping any one of the seven bits, you'll find that a different 
syndrome is obtained in each case — seven non-zero syndromes, one for each 
bit. There is only one other syndrome, the all-zero syndrome. So if the 
channel is a binary symmetric channel with a small noise level f, the optimal 
decoder unflips at most one bit, depending on the syndrome, as shown in 
algorithm 1.16. Each syndrome could have been caused by other noise patterns 
too, but any other noise pattern that has the same syndrome must be less 
probable because it involves a larger number of noise events. 

What happens if the noise actually flips more than one bit? Figure 1.15e 
shows the situation when two bits, r3 and r7, are received flipped. The syn- 
drome, 110, makes us suspect the single bit r2; so our optimal decoding al- 
gorithm flips this bit, giving a decoded pattern with three errors as shown 
in figure 1.15e’. If we use the optimal decoding algorithm, any two-bit error 
pattern will lead to a decoded seven-bit vector that contains three errors. 


General view of decoding for linear codes: syndrome decoding 


We can also describe the decoding problem for a linear code in terms of matrices. 
The first four received bits, ryrar3r4, purport to be the four source bits; and the 
received bits rsrgr7z purport to be the parities of the source bits, as defined by 
the generator matrix G. We evaluate the three parity-check bits for the received 
bits, ri1rer3ra, and see whether they match the three received bits, rsrerz7. The 
differences (modulo 2) between these two triplets are called the syndrome of the 
received vector. If the syndrome is zero — if all three parity checks are happy 
— then the received vector is a codeword, and the most probable decoding is 
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Figure 1.15. Pictorial 
representation of decoding of the 
Hamming (7,4) code. The 
received vector is written into the 
diagram as shown in (a). In 
(b,c,d,e), the received vector is 
shown, assuming that the 
transmitted vector was as in 
figure 1.13b and the bits labelled 
by x were flipped. The violated 
parity checks are highlighted by 
dashed circles. One of the seven 
bits is the most probable suspect 
to account for each ‘syndrome’, 
i.e., each pattern of violated and 
satisfied parity checks. 

In examples (b), (c), and (d), the 
most probable suspect is the one 
bit that was flipped. 

In example (e), two bits have been 
flipped, s3 and t7. The most 
probable suspect is r2, marked by 
a circle in (e’), which shows the 
output of the decoding algorithm. 


Algorithm 1.16. Actions taken by 
the optimal decoder for the (7, 4) 
Hamming code, assuming a 
binary symmetric channel with 
small noise level f. The syndrome 
vector z lists whether each parity 
check is violated (1) or satisfied 
(0), going through the checks in 
the order of the bits rs, rg, and r7. 
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f=10% 


REDUNDAN REDONDAS 
— 





given by reading out its first four bits. If the syndrome is non-zero, then the 
noise sequence for this block was non-zero, and the syndrome is our pointer to 
the most probable error pattern. 


The computation of the syndrome vector is a linear operation. If we define the 
3 x 4 matrix P such that the matrix of equation (1.26) is 


I 
t_ | 4 
G = | P | (1.29) 
where I; is the 4 x 4 identity matrix, then the syndrome vector is z = Hr, 
where the parity-check matrix H is given by H = [ -P Í: iF in modulo 2 
arithmetic, —1 = 1, so 


1 11010 0 
H=[P k]=|0 111010 (1.30) 
101 1 0 0 1 
All the codewords t = G's of the code satisfy 
0 
Ht=] 0}. (1.31) 
0 


> Exercise 1.4/4) Prove that this is so by evaluating the 3 x 4 matrix HG’. 


Since the received vector r is given by r = G's +n, the syndrome-decoding 
problem is to find the most probable noise vector n satisfying the equation 


Hn =z. (1.32) 


A decoding algorithm that solves this problem is called a maximum-likelihood 
decoder. We will discuss decoding problems like this in later chapters. 


Summary of the (7,4) Hamming code’s properties 


Every possible received vector of length 7 bits is either a codeword, or it’s one 
flip away from a codeword. 

Since there are three parity constraints, each of which might or might not 
be violated, there are 2 x 2 x 2 = 8 distinct syndromes. They can be divided 
into seven non-zero syndromes — one for each of the one-bit error patterns — 
and the all-zero syndrome, corresponding to the zero-noise case. 

The optimal decoder takes no action if the syndrome is zero, otherwise it 
uses this mapping of non-zero syndromes onto one-bit error patterns to unflip 
the suspect bit. 


ŝ 





Figure 1.17. Transmitting 10 000 
source bits over a binary 
symmetric channel with f = 10% 
using a (7,4) Hamming code. The 
probability of decoded bit error is 
about 7%. 
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1.2: Error-correcting codes for the binary symmetric channel 


There is a decoding error if the four decoded bits §1, 82, §3, 84 do not all 
match the source bits 81,82, 83,84. The probability of block error pg is the 
probability that one or more of the decoded bits in one block fail to match the 
corresponding source bits, 

pp = P(s#s). (1.33) 
The probability of bit error pp is the average probability that a decoded bit 
fails to match the corresponding source bit, 


ie 
Po =F XO Phê F 8k). (1.34) 
k=1 


In the case of the Hamming code, a decoding error will occur whenever 
the noise has flipped more than one bit in a block of seven. The probability 
of block error is thus the probability that two or more bits are flipped in a 
block. This probability scales as O(f7), as did the probability of error for the 
repetition code R3. But notice that the Hamming code communicates at a 
greater rate, R = 4/7. 

Figure 1.17 shows a binary image transmitted over a binary symmetric 
channel using the (7,4) Hamming code. About 7% of the decoded bits are 
in error. Notice that the errors are correlated: often two or three successive 
decoded bits are flipped. 


Exercise 1.5.!4] This exercise and the next three refer to the (7,4) Hamming 


> code. Decode the received strings: 
(a) r = 1101011 
(b) r = 0110110 
(c) r = 0100111 
(d) r = 1111111. 


Exercise 1.6.1% P-17] (a) Calculate the probability of block error pg of the 
cle (7,4) Hamming code as a function of the noise level f and show 
that to leading order it goes as 21f?. 


(b) 3] Show that to leading order the probability of bit error pp goes 
as 9f?. 


Exercise 1.7.1 P19] Find some noise vectors that give the all-zero syndrome 


> (that is, noise vectors that leave all the parity checks unviolated). How 
many such noise vectors are there? 


> Exercise 1.8.!?] I asserted above that a block decoding error will result when- 
ever two or more bits are flipped in a single block. Show that this is 
indeed so. [In principle, there might be error patterns that, after de- 
coding, led only to the corruption of the parity bits, with no source bits 
incorrectly decoded.] 


Summary of codes’ performances 


Figure 1.18 shows the performance of repetition codes and the Hamming code. 
It also shows the performance of a family of linear block codes that are gen- 
eralizations of Hamming codes, called BCH codes. 

This figure shows that we can, using linear block codes, achieve better 
performance than repetition codes; but the asymptotic situation still looks 
grim. 
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Exercise 1.9.4 P-19] Design an error-correcting code and a decoding algorithm 
= for it, estimate its probability of error, and add it to figure 1.18. [Don’t 
worry if you find it difficult to make a code better than the Hamming 
code, or if you find it difficult to find a good decoder for your code; that’s 

the point of this exercise.] 


Exercise 1.10.1 P20] A (7,4) Hamming code can correct any one error; might 
= there be a (14,8) code that can correct any two errors? 
Optional extra: Does the answer to this question depend on whether the 


code is linear or nonlinear? 


Exercise 1.11.4 P-?1] Design an error-correcting code, other than a repetition 
= code, that can correct any two errors in a block of size N. 


> 1.3 What performance can the best codes achieve? 


There seems to be a trade-off between the decoded bit-error probability py 
(which we would like to reduce) and the rate R (which we would like to keep 
large). How can this trade-off be characterized? What points in the (R, pp) 
plane are achievable? This question was addressed by Claude Shannon in his 
pioneering paper of 1948, in which he both created the field of information 
theory and solved most of its fundamental problems. 

At that time there was a widespread belief that the boundary between 
achievable and nonachievable points in the (R, pp) plane was a curve passing 
through the origin (R, pp) = (0,0); if this were so, then, in order to achieve 
a vanishingly small error probability pp, one would have to reduce the rate 
correspondingly close to zero. ‘No pain, no gain.’ 

However, Shannon proved the remarkable result that the boundary be- X 
tween achievable and nonachievable points meets the R axis at a non-zero 
value R = C, as shown in figure 1.19. For any channel, there exist codes that 
make it possible to communicate with arbitrarily small probability of error py 
at non-zero rates. The first half of this book (Parts HIIT) will be devoted to 
understanding this remarkable result, which is called the noisy-channel coding 


theorem. 


Example: f = 0.1 


The maximum rate at which communication is possible with arbitrarily small 
pp is called the capacity of the channel. The formula for the capacity of a 
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Figure 1.19. Shannon’s 
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0 02 04 “06 08 1 H, are defined in equation (1.35). 
Rate 
binary symmetric channel with noise level f is 
1 1 
C(f) =1- Aa(f) =1- Flog, = + (1 — flog 7—> ; (1.35) 


the channel we were discussing earlier with noise level f = 0.1 has capacity 
C ~ 0.53. Let us consider what this means in terms of noisy disk drives. The 
repetition code Rg could communicate over this channel with pp = 0.03 at a 
rate R = 1/3. Thus we know how to build a single gigabyte disk drive with 
Pp = 0.03 from three noisy gigabyte disk drives. We also know how to make a 
single gigabyte disk drive with pp ~ 10715 from sixty noisy one-gigabyte drives 
(exercise 1.3, p.8). And now Shannon passes by, notices us juggling with disk 


drives and codes and says: 


‘What performance are you trying to achieve? 10715? You don’t 
need sizty disk drives — you can get that performance with just 
two disk drives (since 1/2 is less than 0.53). And if you want 
pp = 10718 or 1074 or anything, you can get there with two disk 


drives too!’ 
(Strictly, the above statements might not be quite right, since, as we shall see, 
Shannon proved his noisy-channel coding theorem by studying sequences of 
block codes with ever-increasing blocklengths, and the required blocklength 
might be bigger than a gigabyte (the size of our disk drive), in which case, 
Shannon might say ‘well, you can’t do it with those tiny disk drives, but if you 
had two noisy terabyte drives, you could make a single high-quality terabyte 


drive from them’. | 


> 1.4 Summary 


The (7,4) Hamming Code 


By including three parity-check bits in a block of 7 bits it is possible to detect 
and correct any single bit error in each block. 


Shannon’s noisy-channel coding theorem 


Information can be communicated over a noisy channel at a non-zero rate with 
arbitrarily small error probability. 
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Information theory addresses both the limitations and the possibilities of 
communication. The noisy-channel coding theorem, which we will prove in 
Chapter 10, asserts both that reliable communication at any rate beyond the 
capacity is impossible, and that reliable communication at all rates up to 
capacity is possible. 

The next few chapters lay the foundations for this result by discussing 
how to measure information content and the intimately related topic of data 
compression. 


> 1.5 Further exercises 


> Exercise 1.12.1% P24] Consider the repetition code Rg. One way of viewing 
this code is as a concatenation of Rg with Rg. We first encode the 
source stream with R3, then encode the resulting output with R3. We 
could call this code Rey, This idea motivates an alternative decoding 
algorithm, in which we decode the bits three at a time using the decoder 
for R3; then decode the decoded bits from that first decoder using the 
decoder for R3. 


Evaluate the probability of error for this decoder and compare it with 
the probability of error for the optimal decoder for Rg. 


Do the concatenated encoder and decoder for RŽ have advantages over 
those for Rg? 


> 1.6 Solutions 


Solution to exercise 1.2 (p.7). An error is made by R3 if two or more bits are 
flipped in a block of three. So the error probability of R3 is a sum of two 
terms: the probability that all three bits are flipped, f?; and the probability 
that exactly two bits are flipped, 3f?(1 — f). [If these expressions are not 
obvious, see example 1.1 (p.1): the expressions are P(r=3]| f,N=3) and 
P(r=2|f,N=3).] 


Pb = pp = 3f7(1— f) + f? = 3f? — 2f?. (1.36) 


This probability is dominated for small f by the term 3f?. 
See exercise 2.38 (p.39) for further discussion of this problem. 


Solution to exercise 1.3 (p.8). The probability of error for the repetition code 
Ry is dominated by the probability that [N/2] bits are flipped, which goes 


(for odd N) as Notation: [N/2] denotes the 
N N-+41)/2 N-1)/2 smallest integer greater than or 
(iwa) a ne aay Baan 


The term (5) can be approximated using the binary entropy function: 


L ONEKIN) - [N NHo(K/N) N NHo(K/N) 
=- < < 2 ~ 3 3 
we S|] =? EA ee ses) 


where this approximation introduces an error of order VN — as shown in 
equation (1.17). So 


pp = pp S 2N(F(1— fy)? = afa- fy”. (1.39) 


Setting this equal to the required value of 10715 we find N ~ 2 eo = 68. 
This answer is a little out because the approximation we used overestimated 


Go and we did not distinguish between |N/2] and N/2. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 


You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


1.6: Solutions 


A slightly more careful answer (short of explicit computation) goes as follows. 
Taking the approximation for CG) to the next order, we find: 


ee ~ a (1.40) 


This approximation can be proved from an accurate version of Stirling’s ap- 
proximation (1.12), or by considering the binomial distribution with p = 1/2 
and noting 


N/2 


t= Ce x 2N o N rete aoe ee 270, (1.41) 


K =—N/2 


where o = \/N/4, from which equation (1.40) follows. The distinction between 


[N/2] = N/2 is not important in this term since (2) has a maximum at 
K = N/2. 


Then the probability of error (for odd N) is to leading order 


mela a ee 


l2 





oN 4 __ pip — py®-Y? x P&A -/2, (1.43) 


y TN/2 


The equation pp = 10715 can be written 


1 
Np SS a 


log 10715 + log N/s 





(N = 1)/2 x» ——— (1.44) 
log4f(1— f 7 

which may be solved for N iteratively, the first iteration starting from Ny = 68: 
a —15+1.7 a 

(Ñ2-1)/2 ~ ae = 29.9 No ~ 60.9. (1.45) 


This answer is found to be stable, so N ~ 61 is the blocklength at which 
10>", 


Solution to exercise 1.6 (p.13). 


(a) 


The probability of block error of the Hamming code is a sum of six terms 
— the probabilities that 2, 3, 4, 5, 6, or 7 errors occur in one block. 


m=3-(! are a- (1.46) 


To leading order, this goes as 
pp - Pe 21 f°. (1.47) 


The probability of bit error of the Hamming code is smaller than the 
probability of block error because a block error rarely corrupts all bits in 
the decoded block. The leading-order behaviour is found by considering 
the outcome in the most probable case where the noise vector has weight 
two. The decoder will erroneously flip a third bit, so that the modified 
received vector (of length 7) differs in three bits from the transmitted 
vector. That means, if we average over all seven bits, the probability that 
a randomly chosen bit is flipped is 3/7 times the block error probability, 
to leading order. Now, what we really care about is the probability that 


17 


In equation (1.44), the logarithms 
can be taken to any base, as long 
as it’s the same base throughout. 
In equation (1.45), I use base 10. 
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18 1 — Introduction to Information Theory 


a source bit is flipped. Are parity bits or source bits more likely to be 
among these three flipped bits, or are all seven bits equally likely to be 
corrupted when the noise vector has weight two? The Hamming code 
is in fact completely symmetric in the protection it affords to the seven 
bits (assuming a binary symmetric channel). [This symmetry can be 
proved by showing that the role of a parity bit can be exchanged with 
a source bit and the resulting code is still a (7,4) Hamming code; see 
below.] The probability that any one bit ends up corrupted is the same 
for all seven bits. So the probability of bit error (for the source bits) is 
simply three sevenths of the probability of block error. 


3 
Pb ~ FPB ~ 9f?. (1.48) 


Symmetry of the Hamming (7,4) code 


To prove that the (7,4) code protects all bits equally, we start from the parity- 
check matrix 


1 1 
H=|0 1 (1.49) 
1 0 


eere 
ereo 
oor 
oreo 
H O O 


The symmetry among the seven transmitted bits will be easiest to see if we 
reorder the seven bits using the permutation (tıtətzt4tstet7) — (tstetstatitet7). 
Then we can rewrite H thus: 


A: t 
H=|0 1 (1.50) 
0 0 


erer 
ereo 
FOR 
or Oo 
e O O 


Now, if we take any two parity constraints that t satisfies and add them 
together, we get another parity constraint. For example, row 1 asserts t5 + 
to + t3 + tı = even, and row 2 asserts t2 + t3 + t4 + tg = even, and the sum of 
these two constraints is 





t5 H 2tə H 2t3 ty t4 H tg = even; (1.51) 


we can drop the terms 2t2 and 2t3, since they are even whatever tz and t3 are; 
thus we have derived the parity constraint t5 + tı + t4 + te = even, which we 
can if we wish add into the parity-check matrix as a fourth row. [The set of 
vectors satisfying Ht = 0 will not be changed.] We thus define 


H’ = (1.52) 


rFOOF 
oorr 
Orrr 
PRR Oo 
PROF 
FOr Oo 
O ae OO 


The fourth row is the sum (modulo two) of the top two rows. Notice that the 
second, third, and fourth rows are all cyclic shifts of the top row. If, having 
added the fourth redundant constraint, we drop the first constraint, we obtain 
a new parity-check matrix H”, 


H” = ; (1.53) 


H O O 
oor 
Orr 
RRR 
KR Oo 
FOR 
Oro 


which still satisfies H’t = 0 for all codewords, and which looks just like 
the starting H in (1.50), except that all the columns have shifted along one 
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to the right, and the rightmost column has reappeared at the left (a cyclic 
permutation of the columns). 

This establishes the symmetry among the seven bits. Iterating the above 
procedure five more times, we can make a total of seven different H matrices 
for the same original code, each of which assigns each bit to a different role. 

We may also construct the super-redundant seven-row parity-check matrix 
for the code, 


1110100 
O11101 0 
O01110 1 

H”=|1 001110 (1.54) 
O10011 at 
1010011 
1101001 


This matrix is ‘redundant’ in the sense that the space spanned by its rows is 
only three-dimensional, not seven. 

This matrix is also a cyclic matrix. Every row is a cyclic permutation of 
the top row. 


Cyclic codes: if there is an ordering of the bits tı ...ty such that a linear 
code has a cyclic parity-check matrix, then the code is called a cyclic 
code. 


The codewords of such a code also have cyclic properties: any cyclic 
permutation of a codeword is a codeword. 


For example, the Hamming (7,4) code, with its bits ordered as above, 
consists of all seven cyclic shifts of the codewords 1110100 and 1011000, 
and the codewords 0000000 and 1111111. 


Cyclic codes are a cornerstone of the algebraic approach to error-correcting 
codes. We won’t use them again in this book, however, as they have been 
superceded by sparse-graph codes (Part VI). 


Solution to exercise 1.7 (p.13). There are fifteen non-zero noise vectors which 
give the all-zero syndrome; these are precisely the fifteen non-zero codewords 
of the Hamming code. Notice that because the Hamming code is linear, the 
sum of any two codewords is a codeword. 


Graphs corresponding to codes 


Solution to exercise 1.9 (p.14). When answering this question, you will prob- 
ably find that it is easier to invent new codes than to find optimal decoders 
for them. There are many ways to design codes, and what follows is just one 
possible train of thought. We make a linear block code that is similar to the 
(7,4) Hamming code, but bigger. 

Many codes can be conveniently expressed in terms of graphs. In fig- 
ure 1.13, we introduced a pictorial representation of the (7,4) Hamming code. 
If we replace that figure’s big circles, each of which shows that the parity of Gicled Ate the bit nodéesand thes 
four particular bits is even, by a ‘parity-check node’ that is connected to the squares are the parity-check 
four bits, then we obtain the representation of the (7,4) Hamming code by a nodes. 
bipartite graph as shown in figure 1.20. The 7 circles are the 7 transmitted 
bits. The 3 squares are the parity-check nodes (not to be confused with the 
3 parity-check bits, which are the three most peripheral circles). The graph 
is a ‘bipartite’ graph because its nodes fall into two classes — bits and checks 

















Figure 1.20. The graph of the 
(7,4) Hamming code. The 7 
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— and there are edges only between nodes in different classes. The graph and 
the code’s parity-check matrix (1.30) are simply related to each other: each 
parity-check node corresponds to a row of H and each bit node corresponds to 
a column of H; for every 1 in H, there is an edge between the corresponding 
pair of nodes. 

Having noticed this connection between linear codes and graphs, one way 
to invent linear codes is simply to think of a bipartite graph. For example, 
a pretty bipartite graph can be obtained from a dodecahedron by calling the 
vertices of the dodecahedron the parity-check nodes, and putting a transmitted 
bit on each edge in the dodecahedron. This construction defines a parity- 
check matrix in which every column has weight 2 and every row has weight 3. 
[The weight of a binary vector is the number of 1s it contains.] 

This code has N = 30 bits, and it appears to have Mapparent = 20 parity- 
check constraints. Actually, there are only M = 19 independent constraints; 
the 20th constraint is redundant (that is, if 19 constraints are satisfied, then 
the 20th is automatically satisfied); so the number of source bits is K = 
N — M = 11. The code is a (30, 11) code. 

It is hard to find a decoding algorithm for this code, but we can estimate 
its probability of error by finding its lowest-weight codewords. If we flip all 
the bits surrounding one face of the original dodecahedron, then all the parity 
checks will be satisfied; so the code has 12 codewords of weight 5, one for each 
face. Since the lowest-weight codewords have weight 5, we say that the code 
has distance d = 5; the (7,4) Hamming code had distance 3 and could correct 
all single bit-flip errors. A code with distance 5 can correct all double bit-flip 
errors, but there are some triple bit-flip errors that it cannot correct. So the 
error probability of this code, assuming a binary symmetric channel, will be 
dominated, at least for low noise levels f, by a term of order f’, perhaps 


something like 
5 
12(3) Pas fy" 


Of course, there is no obligation to make codes whose graphs can be rep- 
resented on a plane, as this one can; the best linear codes, which have simple 
graphical descriptions, have graphs that are more tangled, as illustrated by 
the tiny (16,4) code of figure 1.22. 


Furthermore, there is no reason for sticking to linear codes; indeed some 


(1.55) 


nonlinear codes — codes whose codewords cannot be defined by a linear equa- 
tion like Ht = 0 — have very good properties. But the encoding and decoding 
of a nonlinear code are even trickier tasks. 


Solution to exercise 1.10 (p.14). First let’s assume we are making a linear 
code and decoding it with syndrome decoding. If there are N transmitted 
bits, then the number of possible error patterns of weight up to two is 


N N N 

(2) +(1) + (0): 
For N = 14, that’s 914+ 14+ 1 = 106 patterns. Now, every distinguishable 
error pattern must give rise to a distinct syndrome; and the syndrome is a 
list of M bits, so the maximum possible number of syndromes is 2”. For a 
(14,8) code, M = 6, so there are at most 2° = 64 syndromes. The number of 
possible error patterns of weight up to two, 106, is bigger than the number of 
syndromes, 64, so we can immediately rule out the possibility that there is a 
(14,8) code that is 2-error-correcting. 


(1.56) 


[ted ~ \ 


i ý ON yey i 
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Figure 1.21. The graph defining 
the (30,11) dodecahedron code. 
The circles are the 30 transmitted 
bits and the triangles are the 20 
parity checks. One parity check is 
redundant. 
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Figure 1.22. Graph of a rate-1/4 
low-density parity-check code 
(Gallager code) with blocklength 
N = 16, and M = 12 parity-check 
constraints. Each white circle 
represents a transmitted bit. Each 
bit participates in j = 3 
constraints, represented by [+ 
squares. The edges between nodes 
were placed at random. (See 
Chapter 47 for more.) 
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1.6: Solutions 


The same counting argument works fine for nonlinear codes too. When 
the decoder receives r = t + n, his aim is to deduce both t and n from r. If 
it is the case that the sender can select any transmission t from a code of size 
St, and the channel can select any noise vector from a set of size Sn, and those 
two selections can be recovered from the received bit string r, which is one of 
at most 2 possible strings, then it must be the case that 


S53. < 2. (1.57) 


So, for a (N, K) two-error-correcting code, whether linear or nonlinear, 


ORORO os 


Solution to exercise 1.11 (p.14). There are various strategies for making codes 
that can correct multiple errors, and I strongly recommend you think out one 
or two of them for yourself. 

If your approach uses a linear code, e.g., one with a collection of M parity 
checks, it is helpful to bear in mind the counting argument given in the previous 
exercise, in order to anticipate how many parity checks, M, you might need. 

Examples of codes that can correct any two errors are the (30, 11) dodeca- 
hedron code on page 20, and the (15,6) pentagonful code to be introduced on 
p.221. Further simple ideas for making codes that can correct multiple errors 
from codes that can correct only one error are discussed in section 13.7. 


Solution to exercise 1.12 (p.16). The probability of error of R32 is, to leading 
order, 


pp(R3) = 3 [pp(Rs)]? = 3(8f7)? +--- = 27f4 +---, (1.59) 


whereas the probability of error of Rg is dominated by the probability of five 
flips, 


Pp(Rg) > (3) PIS A SBH. (1.60) 


The RÊ decoding procedure is therefore suboptimal, since there are noise vec- 
tors of weight four that cause it to make a decoding error. 

It has the advantage, however, of requiring smaller computational re- 
sources: only memorization of three bits, and counting up to three, rather 
than counting up to nine. 

This simple code illustrates an important concept. Concatenated codes 
are widely used in practice because concatenation allows large codes to be 
implemented using simple encoding and decoding hardware. Some of the best 
known practical codes are concatenated codes. 
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Probability, Entropy, and Inference 


This chapter, and its sibling, Chapter 8, devote some time to notation. Just 
as the White Knight distinguished between the song, the name of the song, 
and what the name of the song was called (Carroll, 1998), we will sometimes 
need to be careful to distinguish between a random variable, the value of the 





1 A Pi 
random variable, and the proposition that asserts that the random variable 
: f ; 1 a 0.0575 a 
has a particular value. In any particular chapter, however, I will use the most 3 w -0.0128 
simple and friendly notation possible, at the risk of upsetting pure-minded Poe 0.0263 : 
readers. For example, if something is ‘true with probability 1’, I will usually 4 d 0.0285 d 
simply say that it is ‘true’. 5 e 0.0913 è 
6 £ 0.0173 f 
> 2.1 Probabilities and ensembles ee g 
8 h 0.0313 h 
An ensemble X is a triple (x, Ax, Px), where the outcome «x is the value 9 i 0.0599 i 
, ; 10 j 0.0006 j 
of a random variable, which takes on one of a set of possible values, 11 k 0.0084 k 
Ax = {a1,a@2,...,a;,..., ar}, having probabilities Px = {p1, p2,- .. , PI}, l2 1 0.0335 l 
with P(x=a;) = Pi, Pi > 0 and eae P(x=a;) =i; 13 m 0.0235 m 
14 n 0.0596 n 
The name A is mnemonic for ‘alphabet’. One example of an ensemble is a 15 o 0.0689 o 
letter that is randomly selected from an English document. This ensemble is 16 p 0.0192 P 
shown in figure 2.1. There are twenty-seven possible letters: a-z, and a space 17 q 0.0008 q 
character ‘-’. 18 xr 0.0508 T 
19 s 0.0567 S 
Abbreviations. Briefer notation will sometimes be used. For example, 20 t 0.0706 t 
P(x=a;) may be written as P(a;) or P(x). 21 oe 0:0934 a 
? ý 22 v 0.0069 v 
Probability of a subset. If T is a subset of Ax then: 23 w 0.0119 y 
24 x 0.0073 x 
= 7 >z 25 y 0.0164 y 
P(T) = P(xeT)= X` Pee): (2.1) A : 
ae 27 — 0.1928 = 
For example, if we define V to be vowels from figure 2.1, V = 
{a,e,i,o,u}, then Figure 2.1. Probability 
distribution over the 27 outcomes 
P(V) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. (2.2) for a randomly selected letter in 


an English language document 
(estimated from The Frequently 
Asked Questions Manual for 
Linux). The picture shows the 


We call P(x,y) the joint probability of x and y. probabilities by the areas of white 
squares. 


A joint ensemble XY is an ensemble in which each outcome is an ordered 
pair x,y with z € Ay = {ay,...,ar} and y € Ay = {by,..., bJ}. 


Commas are optional when writing ordered pairs, so ry } x,y. 


N.B. In a joint ensemble XY the two variables are not necessarily inde- 
pendent. 


22 
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2.1: Probabilities and ensembles 23 


8 


Figure 2.2. The probability 
distribution over the 27x27 
possible bigrams xy in an English 
language document, The 
Frequently Asked Questions 
Manual for Linux. 


NSH EGCEAtNKOVDOB BRU PO OonaATD 


YELLE E PE Tee) | LES CE E 
abcdefghijklmnopqrstuvwxyz- y 





Marginal probability. We can obtain the marginal probability P(x) from 
the joint probability P(x,y) by summation: 


P(z=a)= X P(£=a;,y). (2.3) 


ye Ay 


Similarly, using briefer notation, the marginal probability of y is: 


Py) = >> P(a,y). (2.4) 


reEAx 
Conditional probability 
P(x@=a;,,y=b;) . 
=S] 


[If P(y=b;j) = 0 then P(x = a; |y =b;) is undefined.] 


We pronounce P(x =a; |y= bj) ‘the probability that x equals a;, given 
y equals b,’. 


Example 2.1. An example of a joint ensemble is the ordered pair XY consisting 
of two successive letters in an English document. The possible outcomes 
are ordered pairs such as aa, ab, ac, and zz; of these, we might expect 
ab and ac to be more probable than aa and zz. An estimate of the 
joint probability distribution for two neighbouring characters is shown 
graphically in figure 2.2. 


This joint ensemble has the special property that its two marginal dis- 
tributions, P(x) and P(y), are identical. They are both equal to the 
monogram distribution shown in figure 2.1. 


From this joint ensemble P(x,y) we can obtain conditional distributions, 
P(y|x) and P(x |y), by normalizing the rows and columns, respectively 
(figure 2.3). The probability P(y|x=q) is the probability distribution 
of the second letter given that the first letter is a q. As you can see in 
figure 2.3a, the two most probable values for the second letter y given 
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24 2 — Probability, Entropy, and Inference 
x x 
a " 2 me: >unoenunum sumens MeGuh Figure 2.3. Conditional 
b E . bg oy aa s è : 
c . cs probability distributions. (a) 
d 7 df 
eee rere cee J eR P(y| x): Each row shows the 
f È im - cae Era b 
a. -E 8 conditional distribution of the 
h 5 nie ; : F — ; 
| ee F | ooo alia ae ee ‘ee second letter, y, given the first 
a z n 2 : letter, x, in a bigram xy. (b) 
ap a E E eee P(r |y): Each column shows the 
T e m oh a ees conditional distribution of the 
5 E | eee m™ first letter, x, given the second 
PA, seein ee vies Shae tk m 5 eeerrnerenemnme = leticr, y. 
t m t ae a Becta a, ta, ene 
u RENER 7 u cee 
Vv " vV 
W LJ Wi 
x | x 
y E y. 
Zz E Zz 
er ee | noe BHO - Beem: ee ee 
abcdefghijklmnopqrstuvwxyz- y abcdefghijklmnopqrstuvwxyz- y 
(a) P(y| 2) (b) P(x|y) 
that the first letter x is q are u and -. (The space is common after q 


because the source document makes heavy use of the word FAQ.) 


The probability P(a|y=u) is the probability distribution of the first 
letter x given that the second letter y is a u. As you can see in figure 2.3b 
the two most probable values for x given y=u are n and o. 


Rather than writing down the joint probability directly, we often define an 
ensemble in terms of a collection of conditional probabilities. The following 
rules of probability theory will be useful. (H denotes assumptions on which 
the probabilities are based.) 


Product rule — obtained from the definition of conditional probability: 
P(x,y |H) = P(z |y, H)P(y |H) = Ply |x, H)P(z |H). (2.6) 
This rule is also known as the chain rule. 


Sum rule - a rewriting of the marginal probability definition: 


PzH) = J P(zylH) (2.7) 


XO P(e |y, H)P(y |H). (2.8) 


Bayes’ theorem — obtained from the product rule: 


P(y|z,H) = oe (2.9) 
a P(x|y,H)P(y|H) 
= Zp Pel, PWIA) (2.10) 





Independence. Two random variables X and Y are independent (sometimes 
written X LY) if and only if 


P(2,y) = P(a)P(y). (2.11) 


Exercise 2.2.1% P-40] Are the random variables X and Y in the joint ensemble 
= of figure 2.2 independent? 
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2.2: The meaning of probability 


I said that we often define an ensemble in terms of a collection of condi- 
tional probabilities. The following example illustrates this idea. 


Example 2.3. Jo has a test for a nasty disease. We denote Jo’s state of health 
by the variable a and the test result by b. 


a=1 Jo has the disease 


a=0 Jo does not have the disease. (2.12) 


The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0); 
the test is 95% reliable: in 95% of cases of people who really have the 
disease, a positive result is returned, and in 95% of cases of people who 
do not have the disease, a negative result is obtained. The final piece of 
background information is that 1% of people of Jo’s age and background 
have the disease. 


OK — Jo has the test, and the result is positive. What is the probability 
that Jo has the disease? 


Solution. We write down all the provided probabilities. The test reliability 
specifies the conditional probability of b given a: 


P(b=1|a=1) =0.95 P(b=1|a=0) = 0.05 


2.1 
P(b=0|a=1) = 0.05 P(b=0|a=0) = 0.95; (2.43) 
and the disease prevalence tells us about the marginal probability of a: 
P(a=1) = 0.01 P(a=0) = 0.99. (2.14) 


From the marginal P(a) and the conditional probability P(b| a) we can deduce 
the joint probability P(a,b) = P(a)P(b|a) and any other probabilities we are 
interested in. For example, by the sum rule, the marginal probability of b=1 
— the probability of getting a positive result — is 








P(b=1) = P(b=1|a=1)P(a=1) + P(b=1|a=0)P(a=0). (2.15) 


Jo has received a positive result b=1 and is interested in how plausible it is 
that she has the disease (i.e., that a=1). The man in the street might be 
duped by the statement ‘the test is 95% reliable, so Jo’s positive result implies 
that there is a 95% chance that Jo has the disease’, but this is incorrect. The 
correct solution to an inference problem is found using Bayes’ theorem. 


P(b=1|a=1)P(a=1) 


Peal pS = SG= 1 a=) P@= i P= 4=0)Pa=o 
z 0.95 x 0.01 (2.17) 

0.95 x 0.01 + 0.05 x 0.99 
= 0.16. (2.18) 


So in spite of the positive result, the probability that Jo has the disease is only 
16%. o 





»> 2.2 The meaning of probability 


Probabilities can be used in two ways. 

Probabilities can describe frequencies of outcomes in random experiments, 
but giving noncircular definitions of the terms ‘frequency’ and ‘random’ is a 
challenge — what does it mean to say that the frequency of a tossed coin’s 
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26 2 — Probability, Entropy, and Inference 


Box 2.4. The Cox axioms. 
Notation. Let ‘the degree of belief in proposition x’ be denoted by B(x). The If a set of beliefs satisfy these 
negation of x (NOT-x) is written 7. The degree of belief in a condi- axioms then they can be mapped 
tional proposition, ‘x, assuming proposition y to be true’, is represented onto probabilities satisfying 
by B(x] y). P(FALSE) = 0, P(TRUE) = 1, 
i ; ; . 0 < P(x) < 1, and the rules of 
Axiom 1. Degrees of belief can be ordered; if B(x) is ‘greater’ than B(y), and probability: 


B(y) is ‘greater’ than B(z), then B(x) is ‘greater’ than B(z). P(x) =1- P@) 


[Consequence: beliefs can be mapped onto real numbers.] and 


sa i Ste P(x,y) = P(x|y)P(y). 
Axiom 2. The degree of belief in a proposition x and its negation 7 are related. 


There is a function f such that 


B(x) = f[B@)].- 


Axiom 3. The degree of belief in a conjunction of propositions x, y (x AND y) is 
related to the degree of belief in the conditional proposition z |y and the 
degree of belief in the proposition y. There is a function g such that 


B(x, y) = g [B(x | y), B(y)]. 





coming up heads is 1/2? If we say that this frequency is the average fraction of 
heads in long sequences, we have to define ‘average’; and it is hard to define 
‘average’ without using a word synonymous to probability! I will not attempt 
to cut this philosophical knot. 

Probabilities can also be used, more generally, to describe degrees of be- 
lief in propositions that do not involve random variables — for example ‘the 
probability that Mr. S. was the murderer of Mrs. S., given the evidence’ (he 
either was or wasn’t, and it’s the jury’s job to assess how probable it is that he 
was); ‘the probability that Thomas Jefferson had a child by one of his slaves’; 
‘the probability that Shakespeare’s plays were written by Francis Bacon’; or, 
to pick a modern-day example, ‘the probability that a particular signature on 
a particular cheque is genuine’. 

The man in the street is happy to use probabilities in both these ways, but 
some books on probability restrict probabilities to refer only to frequencies of 
outcomes in repeatable random experiments. 

Nevertheless, degrees of belief can be mapped onto probabilities if they sat- 
isfy simple consistency rules known as the Cox axioms (Cox, 1946) (figure 2.4). 
Thus probabilities can be used to describe assumptions, and to describe in- 
ferences given those assumptions. The rules of probability ensure that if two 
people make the same assumptions and receive the same data then they will 
draw identical conclusions. This more general use of probability to quantify 
beliefs is known as the Bayesian viewpoint. It is also known as the subjective 
interpretation of probability, since the probabilities depend on assumptions. 
Advocates of a Bayesian approach to data modelling and pattern recognition 
do not view this subjectivity as a defect, since in their view, 


you cannot do inference without making assumptions. 


In this book it will from time to time be taken for granted that a Bayesian 
approach makes sense, but the reader is warned that this is not yet a globally 
held view — the field of statistics was dominated for most of the 20th century 
by non-Bayesian methods in which probabilities are allowed to describe only 
random variables. The big difference between the two approaches is that 
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2.3: Forward probabilities and inverse probabilities 
Bayesians also use probabilities to describe inferences. 


> 2.3 Forward probabilities and inverse probabilities 


Probability calculations often fall into one of two categories: forward prob- 
ability and inverse probability. Here is an example of a forward probability 
problem: 


Exercise 2.4.1 P4] An urn contains K balls, of which B are black and W = 
= K — B are white. Fred draws a ball at random from the urn and replaces 
it, N times. 


(a) What is the probability distribution of the number of times a black 
ball is drawn, ng? 

(b) What is the expectation of ng? What is the variance of ng? What 
is the standard deviation of ng? Give numerical answers for the 
cases N = 5 and N = 400, when B = 2 and K = 10. 


Forward probability problems involve a generative model that describes a pro- 
cess that is assumed to give rise to some data; the task is to compute the 
probability distribution or expectation of some quantity that depends on the 
data. Here is another example of a forward probability problem: 


“Exercise 2.5.1% P40] An urn contains K balls, of which B are black and W = 

= K — B are white. We define the fraction fg = B/K. Fred draws N 

times from the urn, exactly as in exercise 2.4, obtaining ng blacks, and 
computes the quantity 

_ (ng — feN)? 

Nfa(l— fe) 

What is the expectation of z? In the case N = 5 and fp = 1/5, what 

is the probability distribution of z? What is the probability that z < 1? 

[Hint: compare z with the quantities computed in the previous exercise.] 


(2.19) 


Like forward probability problems, inverse probability problems involve a 
generative model of a process, but instead of computing the probability distri- 
bution of some quantity produced by the process, we compute the conditional 
probability of one or more of the unobserved variables in the process, given 
the observed variables. This invariably requires the use of Bayes’ theorem. 


Example 2.6. There are eleven urns labelled by u € {0,1,2,...,10}, each con- 
taining ten balls. Urn u contains u black balls and 10 — u white balls. 
Fred selects an urn u at random and draws N times with replacement 
from that urn, obtaining ng blacks and N — ng whites. Fred’s friend, 
Bill, looks on. If after N = 10 draws ng = 3 blacks have been drawn, 
what is the probability that the urn Fred is using is urn u, from Bill’s 
point of view? (Bill doesn’t know the value of u.) 


Solution. The joint probability distribution of the random variables u and ng 
can be written 
P(u,neg|N) = P(ng |u, N)P(u). (2.20) 
From the joint probability of u and ng, we can obtain the conditional 
distribution of u given nz: 


P(u|ng,N) = Pote D (2.21) 
P(ng |u, N)P(u) 


P(ng|N) 





(2.22) 
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2 


Figure 2.5. Joint probability of u 
and ng for Bill and Fred’s urn 
problem, after N = 10 draws. 


CDOOANDORWNRFO 


pi 





012345678910 npg 


The marginal probability of u is P(u) = iq for all u. You wrote down the 














probability of ng given u and N, P(ng|u,N), when you solved exercise 2.4 03 4 
(p.27). [You are doing the highly recommended exercises, aren’t you?] If we 0.25 4 
= 0.2 4 
define fu = u/10 then aie! 
01 4 
N 0.05 4 
P(ng |u, N) = ( J = fa). (2.23) 0 es 
nB a E E E 
012 3 4 5 67 8 9 10 


What about the denominator, P(ng|N)? This is the marginal probability of 


npg, which we can obtain using the sum rule: u P(u|ng=3,N) 
0 0 
P(ng|N) =X P(ung|N) = X P(u)P(ng |u, N). (2.24) E woes 
= “ 2 0.22 
ea sas : : 3 0.29 
So the conditional probability of u given ng is 4 0.24 
5 0.13 
P(u)P N 
Pales; 2 ee (2.25) 6 0.047 
P(np|N) 7 0.0099 
1 1/N 8 0.00086 
= = PBL fy", 22 
P(ng|N) 11 Gs fu? (1 — fu) (220) : gatos 


This conditional distribution can be found by normalizing column 3 of 

figure 2.5 and is shown in figure 2.6. The normalizing constant, the marginal Figure 2.6. Conditional 
probability of ng, is P(ng=3|N=10) = 0.083. The posterior probability ar aes of u given ng =3 and 
(2.26) is correct for all u, including the end-points u=0 and u=10, where a 
fu = 0 and fu = 1 respectively. The posterior probability that u=0 given 
np=3 is equal to zero, because if Fred were drawing from urn 0 it would be 
impossible for any black balls to be drawn. The posterior probability that 


u=10 is also zero, because there are no white balls in that urn. The other 














hypotheses u=1, u=2, ... u=9 all have non-zero posterior probability. 








Terminology of inverse probability 


In inverse probability problems it is convenient to give names to the proba- 
bilities appearing in Bayes’ theorem. In equation (2.25), we call the marginal 
probability P(u) the prior probability of u, and P(n g | u, N) is called the like- 
lihood of u. It is important to note that the terms likelihood and probability 
are not synonyms. The quantity P(ng|u,N) is a function of both ng and 
u. For fixed u, P(ng|u,N) defines a probability over ng. For fixed np, 
P(npg |u, N) defines the likelihood of u. 
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2.3: Forward probabilities and inverse probabilities 





Never say ‘the likelihood of the data’. Always say ‘the likelihood 
of the parameters’. The likelihood function is not a probability 


distribution. 





(If you want to mention the data that a likelihood function is associated with, 
you may say ‘the likelihood of the parameters given the data’.) 

The conditional probability P(u | npg, N) is called the posterior probability 
of u given ng. The normalizing constant P(n pg |N) has no u-dependence so its 
value is not important if we simply wish to evaluate the relative probabilities 
of the alternative hypotheses u. However, in most data-modelling problems 
of any complexity, this quantity becomes important, and it is given various 
names: P(ng|N) is known as the evidence or the marginal likelihood. 

If 0 denotes the unknown parameters, D denotes the data, and H denotes 
the overall hypothesis space, the general equation: 


P(D|0,H)P(6|H) 


POI (2.27) 


P(0| D, H) = 


is written: 


likelihood x prior (2.28) 





posterior = - 
evidence 


Inverse probability and prediction 


Example 2.6 (continued). Assuming again that Bill has observed ng = 3 blacks 
in N = 10 draws, let Fred draw another ball from the same urn. What 
is the probability that the next drawn ball is a black? [You should make 
use of the posterior probabilities in figure 2.6.] 


Solution. By the sum rule, 


P(ballyy is black|ng,N) = X P(ballyys is black | u,ng,N)P(u| nz, N). 


(2.29) 

Since the balls are drawn with replacement from the chosen urn, the proba- 

bility P(ballyy is black | u, npg, N) is just fy, = u/10, whatever ng and N are. 
So 

P(ballyys is black|ng,N) = X` fuP(u|ne,N). (2.30) 


Using the values of P(u|ng,N) given in figure 2.6 we obtain 





P(ballyp is black |ng =3, N=10) = 0.333. O (2.31) 


Comment. Notice the difference between this prediction obtained using prob- 
ability theory, and the widespread practice in statistics of making predictions 
by first selecting the most plausible hypothesis (which here would be that the 
urn is urn u = 3) and then making the predictions assuming that hypothesis 
to be true (which would give a probability of 0.3 that the next ball is black). 
The correct prediction is the one that takes into account the uncertainty by 
marginalizing over the possible values of the hypothesis u. Marginalization 
here leads to slightly more moderate, less extreme predictions. 
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Inference as inverse probability 


Now consider the following exercise, which has the character of a simple sci- 
entific investigation. 


Example 2.7. Bill tosses a bent coin N times, obtaining a sequence of heads 
and tails. We assume that the coin has a probability fy of coming up 
heads; we do not know fp. If ny heads have occurred in N tosses, what 
is the probability distribution of fy? (For example, N might be 10, and 
ny might be 3; or, after a lot more tossing, we might have N = 300 and 
ny = 29.) What is the probability that the N+1th outcome will be a 
head, given ny heads in N tosses? 


Unlike example 2.6 (p.27), this problem has a subjective element. Given a 
restricted definition of probability that says ‘probabilities are the frequencies 
of random variables’, this example is different from the eleven-urns example. 
Whereas the urn u was a random variable, the bias fy of the coin would not 
normally be called a random variable. It is just a fixed but unknown parameter 
that we are interested in. Yet don’t the two examples 2.6 and 2.7 seem to have 
an essential similarity? [Especially when N = 10 and ny = 3!] 
To solve example 2.7, we have to make an assumption about what the bias 
of the coin fy might be. This prior probability distribution over fx, P(fa), | Here P(f) denotes a probability 
corresponds to the prior over u in the eleven-urns problem. In that example, density, rather than a probability 
the helpful problem definition specified P(u). In real life, we have to make distribution. 
assumptions in order to assign priors; these assumptions will be subjective, 
and our answers will depend on them. Exactly the same can be said for the 
other probabilities in our generative model too. We are assuming, for example, 
that the balls are drawn from an urn independently; but could there not be 
correlations in the sequence because Fred’s ball-drawing action is not perfectly 
random? Indeed there could be, so the likelihood function that we use depends 
on assumptions too. In real data modelling problems, priors are subjective and 
so are likelihoods. 


We are now using P() to denote probability densities over continuous vari- 
ables as well as probabilities over discrete variables and probabilities of logical 
propositions. The probability that a continuous variable v lies between values 
a and b (where b > a) is defined to be fedv P(v). P(v)dv is dimensionless. 
The density P(v) is a dimensional quantity, having dimensions inverse to the 
dimensions of v — in contrast to discrete probabilities, which are dimensionless. 
Don’t be surprised to see probability densities greater than 1. This is normal, 
and nothing is wrong, as long as fèw P(v) < 1 for any interval (a,b). 
Conditional and joint probability densities are defined in just the same way as 
conditional and joint probabilities. 


> Exercise 2.8.1] Assuming a uniform prior on fy, P( fH) = 1, solve the problem 
posed in example 2.7 (p.30). Sketch the posterior distribution of fy and 
compute the probability that the N+1th outcome will be a head, for 


(a) N = 3 and ny = 0; 
(b) N = 3 and ny = 2; 
(c) N 

N 


= 10 and ny = 3; 





(d = 300 and ny = 29. 
You will find the beta integral useful: 
1 
T(Fa + L(A +1) Fa! Fo! 
dpa pie (1 — pa) = = ae (2232 
f Pa Pa“ (1 — Pa) T(Fa + Fp + 2) (Fa + Fp +1)! ee 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


2.3: Forward probabilities and inverse probabilities 31 


You may also find it instructive to look back at example 2.6 (p.27) and 
equation (2.31). 


People sometimes confuse assigning a prior distribution to an unknown pa- 
rameter such as fp with making an initial guess of the value of the parameter. 
But the prior over fx, P(fzz), is not a simple statement like ‘initially, I would 
guess fy = 1/2’. The prior is a probability density over fz which specifies the 
prior degree of belief that fp lies in any interval (f, f + df). It may well be 
the case that our prior for fy is symmetric about 1⁄2, so that the mean of fx 
under the prior is 1/2. In this case, the predictive distribution for the first toss 
xı would indeed be 


P(ay=head) = | dfu P(fu)P(es=head | fu) = fafa PUfu)fn = "8. 
(2.33) 
But the prediction for subsequent tosses will depend on the whole prior dis- 
tribution, not just its mean. 


Data compression and inverse probability 


Consider the following task. 


Example 2.9. Write a computer program capable of compressing binary files 
like this one: 
0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000 


1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100 
0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000 


The string shown contains nı = 29 1s and no = 271 Os. 


Intuitively, compression works by taking advantage of the predictability of a 
file. In this case, the source of the file appears more likely to emit Os than 
1s. A data compression program that compresses this file must, implicitly or 
explicitly, be addressing the question ‘What is the probability that the next 
character in this file is a 1?’ 

Do you think this problem is similar in character to example 2.7 (p.30)? 
I do. One of the themes of this book is that data compression and data 
modelling are one and the same, and that they should both be addressed, like 
the urn of example 2.6, using inverse probability. Example 2.9 is solved in 
Chapter 6. 


The likelihood principle 


Please solve the following two exercises. A B 


Example 2.10. Urn A contains three balls: one black, and two white; urn B 





contains three balls: two black, and one white. One of the urns is OO ee 
selected at random and one ball is drawn. The ball is black. What is f 
the probability that the selected urn is urn A? Figure 2.7. Urns for example 2.10. 


Example 2.11. Urn A contains five balls: one black, two white, one green and 
one pink; urn B contains five hundred balls: two hundred black, one 
hundred white, 50 yellow, 40 cyan, 30 sienna, 25 green, 25 silver, 20 
gold, and 10 purple. [One fifth of A’s balls are black; two-fifths of B’s COOMA MO" ee 
are black.] One of the urns is selected at random and one ball is drawn. 

The ball is black. What is the probability that the urn is urn A? Figure 2.8. Urns for example 2.11. 
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32 2 — Probability, Entropy, and Inference 
What do you notice about your solutions? Does each answer depend on the i a Di h(pi) 
detailed contents of each urn? 1 a 0575 41 
The details of the other possible outcomes and their probabilities are ir- 2 b .0128 6.3 
relevant. All that matters is the probability of the outcome that actually 3 c 0263 5.2 
happened (here, that the ball drawn was black) given the different hypothe- 4 da .0285 5.1 
ses. We need only to know the likelihood, i.e., how the probability of the data 5 e 0913 3.5 
that happened varies with the hypothesis. This simple rule about inference is 6 f 0173 5.9 
known as the likelihood principle. [o A= 013a oe 
8 h .0313 5.0 
9 i -.0599 4.1 
The likelihood principle: given a generative model for data d given 10 j .0006 10.7 
parameters 0, P(d|@), and having observed a particular outcome 11 k ~ .0084 6.9 
dı, all inferences and predictions should depend only on the function 12 1.0335 4.9 
P(d; |0). 13 m 0235 5.4 
14 n .0596 4.1 
15 o .0689 3.9 
In spite of the simplicity of this principle, many classical statistical methods 16 p 0192 5.7 
violate it. 17 q .0008 10.3 
18 r .0508 4.3 
acia : 19 s .0567 4.1 
> 2.4 Definition of entropy and related functions 29 t 0706 38 
The Shannon information content of an outcome <x is defined to be 2L ai eee 2 
22 v .0069 7.2 
1 23 w  .0119 6.4 
M = loga aay (2.34) 24 x 0073 71 
25 y 0164 5.9 
It is measured in bits. [The word ‘bit’ is also used to denote a variable 26 z .0007 10.4 
whose value is 0 or 1; I hope context will always make clear which of the 27 = 1928 2.4 

two meanings is intended.] 1 

In the next few chapters, we will establish that the Shannon information DP losz Di ci 


content h(a;) is indeed a natural measure of the information content 
of the event x = a;i. At that point, we will shorten the name of this Table 2.9. Shannon information 
quantity to ‘the information content’. contents of the outcomes a-z. 


The fourth column in table 2.9 shows the Shannon information content 
of the 27 possible outcomes when a random character is picked from 
an English document. The outcome x = z has a Shannon information 
content of 10.4 bits, and x = e has an information content of 3.5 bits. 


The entropy of an ensemble X is defined to be the average Shannon in- 
formation content of an outcome: 


H(X) (2.35) 


Ill 
M 
3 
E 
a 

3 


with the convention for P(x)=0 that 0xlog1/0=0, since 
limg—o+ 0 log 1/0 =0. 


Like the information content, entropy is measured in bits. 


When it is convenient, we may also write H(X) as H(p), where p is 
the vector (p1,p2,...,pr). Another name for the entropy of X is the 
uncertainty of X. 


Example 2.12. The entropy of a randomly selected letter in an English docu- 
ment is about 4.11 bits, assuming its probability is as given in table 2.9. 
We obtain this number by averaging log 1/p; (shown in the fourth col- 
umn) under the probability distribution p; (shown in the third column). 
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2.5: Decomposability of the entropy 


We now note some properties of the entropy function. 
e H(X) > 0 with equality iff p; = 1 for one i. [iff means ‘if and only if’.] 
e Entropy is maximized if p is uniform: 

H(X) < log(|Ax|) with equality iff p; = 1/|Ax| for all i. (2.36) 


Notation: the vertical bars ‘|- |’ have two meanings. If Ax is a set, |Ax| 
denotes the number of elements in Ax; if x is a number, then |z| is the 
absolute value of x. 


The redundancy measures the fractional difference between H(X) and its max- 
imum possible value, log(|Ax|). 


The redundancy of X is: 
A(X) 
log |Ax|- 


We won’t make use of ‘redundancy’ in this book, so I have not assigned 
a symbol to it. 


(2.37) 


The joint entropy of X,Y is: 





1 
H(X,Y)= > P(a,y)log ; (2.38) 
P(x,y) 
ryCAx Ay 
Entropy is additive for independent random variables: 
A(X, Y) = A(X)+ A(Y) iff P(x,y) = P(x)P(y). (2.39) 


Our definitions for information content so far apply only to discrete probability 
distributions over finite sets Ax. The definitions can be extended to infinite 
sets, though the entropy may then be infinite. The case of a probability 
density over a continuous set is addressed in section 11.3. Further important 
definitions and exercises to do with entropy will come along in section 8.1. 


> 2.5 Decomposability of the entropy 


The entropy function satisfies a recursive property that can be very useful 
when computing entropies. For convenience, we’ll stretch our notation so that 
we can write H(X) as H(p), where p is the probability vector associated with 
the ensemble X. 

Let’s illustrate the property by an example first. Imagine that a random 
variable x € {0,1,2} is created by first flipping a fair coin to determine whether 
x = 0; then, if x is not 0, flipping a fair coin a second time to determine whether 
xis 1 or 2. The probability distribution of x is 


1 1 1 
P(x=0) ==; P(w=1)=-; P(a=2)=-. 2.4 
(=0) = 5; Pe=1)= 7; Pw=2)=7 (2.40) 
What is the entropy of X? We can either compute it by brute force: 
H(X) = Yelog2 + !/4log 4 + Y4log4 = 1.5; (2.41) 


or we can use the following decomposition, in which the value of x is revealed 
gradually. Imagine first learning whether «=0, and then, if x is not 0, learning 
which non-zero value is the case. The revelation of whether z=0 or not entails 
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34 2 — Probability, Entropy, and Inference 


revealing a binary variable whose probability distribution is {1/2, 1/2}. This 
revelation has an entropy H (1/2, 1/2) = 5 log 2 + 5 log 2 = 1bit. If x is not 0, 
we learn the value of the second coin flip. This too is a binary variable whose 
probability distribution is {1/2, 1/2}, and whose entropy is 1 bit. We only get 
to experience the second revelation half the time, however, so the entropy can 
be written: 


H(X) = H(2/2, 1/2) + 1/2 H (1/2, 1/2). (2.42) 


Generalizing, the observation we are making about the entropy of any 
probability distribution p = {p,po,..., pr} is that 











(2.43) 


H(p) = H(p 1-p1) + (1-7 ( BS 28 ue ) 


Lapp 1=p1 hey 


When it’s written as a formula, this property looks regrettably ugly; nev- 
ertheless it is a simple property and one that you should make use of. 
Generalizing further, the entropy has the property for any m that 











H(p) = H[|(pı + pet+-+++ Pm), (Pm+1 + Pm+2 +++: + pr)] 
Pl Pm 
Pı + pm) a E e) 
i JAN E Fpa) a Ep 
Pm+1 PI 
Homa t + p(t, a) 
( ) (Pm+1 +: + pr) (Pm+1 +: + pr) 


(2.44) 


Example 2.13. A source produces a character x from the alphabet A = 
{0,1,...,9,a,b,...,z}; with probability 1/3, æ is a numeral (0,...,9); 
with probability 1/3, x is a vowel (a,e,i,o,u); and with probability 1/3 
it’s one of the 21 consonants. All numerals are equiprobable, and the 
same goes for vowels and consonants. Estimate the entropy of X. 





oO 


Solution. log3 + 4(log 10 + log 5 + log 21) = log 3 + $ log 1050 ~ log 30 bits. 


> 2.6 Gibbs’ inequality B ee 

The ‘ei’ in Leibler is pronounced 

The relative entropy or Kullback—Leibler divergence between two tbe same as in heist. 
probability distributions P(x) and Q(x) that are defined over the same 


alphabet Ax is 
Dux (PIIQ) =) Pla) 10g Z. (2.45) 


The relative entropy satisfies Gibbs’ inequality 
Dyx(P||Q) 2 0 (2.46) 


with equality only if P = Q. Note that in general the relative entropy 
is not symmetric under interchange of the distributions P and Q: in 
general Dei (P\||Q) #4 DkKL(QI|P), so Dg, although it is sometimes 
called the ‘KL distance’, is not strictly a distance. The relative entropy 
is important in pattern recognition and neural networks, as well as in 
information theory. 


Gibbs’ inequality is probably the most important inequality in this book. It, 
and many other inequalities, can be proved using the concept of convexity. 
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2.7: Jensen’s inequality for convex functions 


»> 2.7 Jensen’s inequality for convex functions 


The words ‘convex ~’ and ‘concave ~’ may be pronounced ‘convex-smile’ and 
‘concave-frown’. This terminology has useful redundancy: while one may forget 
which way up ‘convex’ and ‘concave’ are, it is harder to confuse a smile with a 
frown. 


Convex ~ functions. A function f(x) is convex ~ over (a,b) if every chord 
of the function lies above the function, as shown in figure 2.10; that is, 
for all 71,22 € (a,b) andO<A<1, 


ft + (1 —A)z2) < Af(x1) + (1 — A) f (22). (2.47) 


A function f is strictly convex ~ if, for all 21,22 € (a,b), the equality 
holds only for À = 0 and à = 1. 


Similar definitions apply to concave — and strictly concave ~ functions. 


Some strictly convex ~ functions are 


e x”, e” and e® for all z; 


e log(1/z) and zlog z for x > 0. 








Jensen’s inequality. If f is a convex ~ function and z is a random variable 
then: 
E([f(z)] = f(Elz]), (2.48) 


where E denotes expectation. If f is strictly convex ~ and € [f(x)] = 
f(E[a]), then the random variable x is a constant. 


Jensen’s inequality can also be rewritten for a concave ~ function, with 
the direction of the inequality reversed. 


A physical version of Jensen’s inequality runs as follows. 


If a collection of masses p; are placed on a convex ~ curve f(x) 
at locations (x;, f(x;i)), then the centre of gravity of those masses, 
which is at (E[a],€ [f(x)]), lies above the curve. 


If this fails to convince you, then feel free to do the following exercise. 


Exercise 2.14.1 P-44] Prove Jensen’s inequality. 


Example 2.15. Three squares have average area A = 100m?. The average of 


the lengths of their sides is l = 10m. What can be said about the size 
of the largest of the three squares? [Use Jensen’s inequality.] 


Solution. Let x be the length of the side of a square, and let the probability 
of x be 1/3, 1/3, 1/3 over the three lengths 1;, 12,13. Then the information that 
we have is that £ [x] = 10 and E [f(x)] = 100, where f(x) = 2? is the function 
mapping lengths to areas. This is a strictly convex ~ function. We notice 
that the equality E [f(x)] = f(E[a]) holds, therefore x is a constant, and the 
three lengths must all be equal. The area of the largest square is 100m?. O 





35 





T1 1 T2 
xv* = Azı + (1 — à)x2 


Figure 2.10. Definition of 
convexity. 


Figure 2.11. Convex ~ functions. 


Centre of gravity 


iss] 
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36 2 — Probability, Entropy, and Inference 


Convexity and concavity also relate to maximization 


If f(x) is concave ~ and there exists a point at which 


is = 0 for all k, (2.49) 
Ork 
then f(x) has its maximum value at that point. 

The converse does not hold: if a concave ~ f(x) is maximized at some x 
it is not necessarily true that the gradient V f(x) is equal to zero there. For 
example, f(x) = —|x| is maximized at x = 0 where its derivative is undefined; 
and f(p) = log(p), for a probability p € (0,1), is maximized on the boundary 
of the range, at p = 1, where the gradient df(p)/dp = 1. 


»> 2.8 Exercises 
Sums of random variables 


Exercise 2.16.19 P4 (a) Two ordinary dice with faces labelled 1,...,6 are 
= thrown. What is the probability distribution of the sum of the val- 
ues? What is the probability distribution of the absolute difference 

between the values? 


(b) One hundred ordinary dice are thrown. What, roughly, is the prob- This exercise is intended to help 
ability distribution of the sum of the values? Sketch the probability you think about the central-limit 


distribution and estimate its mean and standard deviation. theorem, which says that if 
independent random variables 


(c) How can two cubical dice be labelled using the numbers @1,22,...,0N have means un and 
{0, 1,2,3,4,5,6} so that when the two dice are thrown the sum finite variances a2, then, in the 
has a uniform probability distribution over the integers 1-12? limit of large N, the sum }7,, £n 


i ee a has a distribution that tends to a 
(d) Is there any way that one hundred dice could be labelled with inte- normal (Gaussian) distribution 


gers such that the probability distribution of the sum is uniform? with mean J`, jin and variance 


2 
n In: 


Inference problems 


“> Exercise 2.17.1% P41] yf q= 1-— p and a = ln p/q, show that 
7 1 


= eu (2.50) 


P 
Sketch this function and find its relationship to the hyperbolic tangent 


function tanh(u) = S. 


It will be useful to be fluent in base-2 logarithms also. If b = logs p/q, 

what is p as a function of b? 

> Exercise 2.18.1% P4] Let x and y be dependent random variables with x a 
binary variable taking values in Ax = {0,1}. Use Bayes’ theorem to 
show that the log posterior probability ratio for x given y is 





Pw=1y) pg PUED | PEED 
Epea Fuk i 8 B=). (2.51) 
[2, p.42] 


> Exercise 2.19. Let x, dı and dz be random variables such that dı and 

dz are conditionally independent given a binary variable x. Use Bayes’ 

theorem to show that the posterior probability ratio for x given {d;} is 
P(w=1|{di}) P(dı|x=1) P(dg|x=1) P(w=1) 


Ple=0/{di}) ~ P(h|2=0)P@ala=0P@=oy CP 
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2.8: Exercises 


Life in high-dimensional spaces 


Probability distributions and volumes have some unexpected properties in 
high-dimensional spaces. 


“Exercise 2.20.1% P4] Consider a sphere of radius r in an N-dimensional real 

= space. Show that the fraction of the volume of the sphere that is in the 

surface shell lying at values of the radius between r — e and r, where 
0<e<r,is: 


€ N 
f=1-(1-S) (2.53) 
r 
Evaluate f for the cases N =2, N =10 and N =1000, with (a) €/r =0.01; 
(b) e/r =0.5. 
Implication: points that are uniformly distributed in a sphere in N di- 


mensions, where N is large, are very likely to be in a thin shell near the 
surface. 


Expectations and entropies 


You are probably familiar with the idea of computing the expectation of a 
function of x, 


E [FE] = (f(2)) = So Pa) f (2). (2.54) 


Maybe you are not so comfortable with computing this expectation in cases 
where the function f(x) depends on the probability P(x). The next few ex- 
amples address this concern. 


Exercise 2.21.!4 P-49] Let pg=0.1, py=0.2, and pe=0.7. Let f(a)=10, 
= f(b) =5, and f(c)=10/7. What is E[f(a)]? What is € [1/P(x)]? 


Exercise 2.22.!% P-48] For an arbitrary ensemble, what is £ [1/P(«)|? 
> Exercise 2.23.!4 P-43] Let p,=0.1, pp =0.2, and pe =0.7. Let g(a) =0, g(b) =1, 
and g(c)=0. What is E [g(a)]? 








> Exercise 2.24.!4 P-43] Let Pa=0.1, pp =0.2, and pe=0.7. What is the proba- 
bility that P(x) € [0.15,0.5]? What is 


P(x) 
P | |log ——*| > 0.05 }? 
fsa |=") 


Exercise 2.25.15 P-43] Prove the assertion that H(X) < log(|Ax|) with equal- 
~ ity iff p; = 1/|Ax| for all i. (|Ax| denotes the number of elements in 
the set Ax.) [Hint: use Jensen’s inequality (2.48); if your first attempt 
to use Jensen does not succeed, remember that Jensen involves both a 
random variable and a function, and you have quite a lot of freedom in 
choosing these; think about whether your chosen function f should be 

convex or concave.] 


> Exercise 2.26.13 P-44] Prove that the relative entropy (equation (2.45)) satisfies 
Dx (P||Q) > 0 (Gibbs’ inequality) with equality only if P = Q. 


> Exercise 2.27.!] Prove that the entropy is indeed decomposable as described 
in equations (2.43-2.44). 
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> Exercise 2.28.1% P-45] A random variable z € {0,1,2,3} is selected by flipping 
a bent coin with bias f to determine whether the outcome is in {0,1} or Gy 0 
{2,3}; then either flipping a second bent coin with bias g or a third bent 
coin with bias h respectively. Write down the probability distribution f Ly 
of x. Use the decomposability of the entropy (2.44) to find the entropy 
of X. [Notice how compact an expression is obtained if you make use 
of the binary entropy function Hə(x), compared with writing out the 1—f hw 2 
four-term entropy explicitly.]| Find the derivative of H(X) with respect 
to f. [Hint: dH2(x)/dx = log((1 — x)/x).] 

> Exercise 2.29.1% P-45] An unbiased coin is flipped until one head is thrown. 
What is the entropy of the random variable x € {1,2,3,...}, the num- 
ber of flips? Repeat the calculation for the case of a biased coin with 
probability f of coming up heads. [Hint: solve the problem both directly 
and by using the decomposability of the entropy (2.43).] 


> 2.9 Further exercises 


Forward probability 


> Exercise 2.30.!4] An urn contains w white balls and b black balls. Two balls 
are drawn, one after the other, without replacement. Prove that the 
probability that the first ball is white is equal to the probability that the 
second is white. 


> Exercise 2.31.7] A circular coin of diameter a is thrown onto a square grid 
whose squares are b x b. (a < b) What is the probability that the coin 
will lie entirely within one square? [Ans: (1 — a/b)?| 


> Exercise 2.32.19] Buffon’s needle. A needle of length a is thrown onto a plane 
covered with equally spaced parallel lines with separation b. What is 
the probability that the needle will cross a line? [Ans, if a < b: 24/70] 
[Generalization — Buffon’s noodle: on average, a random curve of length 
A is expected to intersect the lines 24/rb times.] 


Exercise 2.33.2] Two points are selected at random on a straight line segment 
of length 1. What is the probability that a triangle can be constructed 
out of the three resulting segments? 


Exercise 2.34.1% P-45] An unbiased coin is flipped until one head is thrown. 
What is the expected number of tails and the expected number of heads? 


Fred, who doesn’t know that the coin is unbiased, estimates the bias 
using f = h/(h + t), where h and t are the numbers of heads and tails 
tossed. Compute and sketch the probability distribution of f. 


N.B., this is a forward probability problem, a sampling theory problem, 
not an inference problem. Don’t use Bayes’ theorem. 


Exercise 2.35.1 P-45] Fred rolls an unbiased six-sided die once per second, not- 
> ing the occasions when the outcome is a six. 
(a) What is the mean number of rolls from one six to the next six? 


(b) Between two rolls, the clock strikes one. What is the mean number 
of rolls until the next six? 
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2.9: Further exercises 39 


(c) Now think back before the clock struck. What is the mean number 
of rolls, going back in time, until the most recent six? 


(d) What is the mean number of rolls from the six before the clock 
struck to the next six? 


(e) Is your answer to (d) different from your answer to (a)? Explain. 


Another version of this exercise refers to Fred waiting for a bus at a 
bus-stop in Poissonville where buses arrive independently at random (a 
Poisson process), with, on average, one bus every six minutes. What is 
the average wait for a bus, after Fred arrives at the stop? [6 minutes.] So 
what is the time between the two buses, the one that Fred just missed, 
and the one that he catches? [12 minutes.] Explain the apparent para- 
dox. Note the contrast with the situation in Clockville, where the buses 
are spaced exactly 6 minutes apart. There, as you can confirm, the mean 
wait at a bus-stop is 3 minutes, and the time between the missed bus 
and the next one is 6 minutes. 


Conditional probability 


> Exercise 2.36. [2] 


Bob. 
What is the probability that Fred is older than Bob? 


Fred tells you that he is older than Alf. Now, what is the probability 
that Fred is older than Bob? (That is, what is the conditional probability 
that F > B given that F > A?) 


You meet Fred. Fred tells you he has two brothers, Alf and 


> Exercise 2.37.12 ] The inhabitants of an island tell the truth one third of the 


time. They lie with probability 2/3. 


On an occasion, after one of them made a statement, you ask another 
‘was that statement true?’ and he says ‘yes’. 


What is the probability that the statement was indeed true? 


> Exercise 2.38.1 P-46] Compare two ways of computing the probability of error 
of the repetition code Rs, assuming a binary symmetric channel (you 
did this once for exercise 1.2 (p.7)) and confirm that they give the same 
answer. 


Binomial distribution method. Add the probability that all three 
bits are flipped to the probability that exactly two bits are flipped. 


Sum rule method. Using the sum rule, compute the marginal prob- 
ability that r takes on each of the eight possible values, P(r). 
[P(r) = >>, P(s)P(r|s).] Then compute the posterior probabil- 
ity of s for each of the eight values of r. [In fact, by symmetry, 
only two example cases r = (000) and r = (001) need be consid- 
ered.| Notice that some of the inferred bits are better determined Equation (1.18) gives the 
than others. From the posterior probability P(s|r) you can read posterior probability of the input 
out the case-by-case error probability, the probability that the more $ Siven the received vector r. 
probable hypothesis is not correct, P(error|r). Find the average 
error probability using the sum rule, 


P(error) = 5 P(r)P(error |r). (2.55) 
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> Exercise 2.39.19 P46] The frequency pn of the nth most frequent word in 
English is roughly approximated by 


p =| Ot for n €1,...,12367 


0 n> 12367. 22) 


[This remarkable 1/n law is known as Zipf’s law, and applies to the word 
frequencies of many languages (Zipf, 1949).] If we assume that English 
is generated by picking words at random according to this distribution, 
what is the entropy of English (per word)? [This calculation can be found 
in ‘Prediction and entropy of printed English’, C.E. Shannon, Bell Syst. 
Tech. J. 30, pp.50-64 (1950), but, inexplicably, the great man made 
numerical errors in it.] 


»> 2.10 Solutions 


Solution to exercise 2.2 (p.24). No, they are not independent. If they were 
then all the conditional distributions P(y |x) would be identical functions of 
y, regardless of x (cf. figure 2.3). 


Solution to exercise 2.4 (p.27). We define the fraction fg = B/K. 


(a) The number of black balls has a binomial distribution. 
N nB N-ng 
P(ng|fe,N)= np B (1 — fp) . (2.57) 


(b) The mean and variance of this distribution are: 
E[ng] = N fg (2.58) 


varing] = N fg(1 — fe). (2.59) 


These results were derived in example 1.1 (p.1). The standard deviation 
of ng is y varing] = vN fg(1 — fB). 

When B/K = 1/5 and N = 5, the expectation and variance of npg are 1 
and 4/5. The standard deviation is 0.89. 


When B/K = 1/5 and N = 400, the expectation and variance of ng are 
80 and 64. The standard deviation is 8. 


Solution to exercise 2.5 (p.27). The numerator of the quantity 


„— 0B- fBN)? 
N fa(l — fs) 


can be recognized as (ng — E[n Bl)”; the denominator is equal to the variance 
of ng (2.59), which is by definition the expectation of the numerator. So the 
expectation of z is 1. [A random variable like z, which measures the deviation 
of data from the expected value, is sometimes called x? (chi-squared).] 

In the case N = 5 and fg = 1/5, Nfs is 1, and var[ng] is 4/5. The 
numerator has five possible values, only one of which is smaller than 1: (ng — 
fpN)? = 0 has probability P(ng=1) = 0.4096; so the probability that z < 1 
is 0.4096. 
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2.10: Solutions 


Solution to exercise 2.14 (p.35). We wish to prove, given the property 


FAzi + (1—Aj)z2) < Af(x1)+ (1 — A) f (x2), (2.60) 
that, if X` p; = 1 and p; > 0, 
I I 
X pif (wi) > f (Zra) ; (2.61) 
i=l i=1 


We proceed by recursion, working from the right-hand side. (This proof does 
not handle cases where some p; = 0; such details are left to the pedantic 


reader.) At the first line we use the definition of convexity (2.60) with A = 


—=fPi— = pi; at the second line, \ = —f2—. 
Dhim M X 


bee) 
heoi) 


Solution to exercise 2.16 (p.36). 


< pf(xı)+ (2.62) 











IA 


pif (@1) + -f (#2) 4 











and so forth. 





(a) For ae outcomes {2, 3, 4, 5, gi 7, . Ms 10, 11,12}, the probabilities are P = 
3 4 5 6 5 
(36. 36° 36° 36° 36? 36’ 36° 36° T oe 36 





(b) The value of one die has mean 3.5 and variance 35/12. So the sum of 
one hundred has mean 350 and variance 3500/12 ~ 292, and by the 
central-limit theorem the probability distribution is roughly Gaussian 
(but confined to the integers), with this mean and variance. 


(c) In order to obtain a sum that has a uniform distribution we have to start 
from random variables some of which have a spiky distribution with the 
probability mass concentrated at the extremes. The unique solution is 
to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0. 


(d) Yes, a uniform distribution can be created in several ways, for example 








by labelling the rth die with the numbers {0,1,2,3,4,5} x 6”. 
Solution to exercise 2.17 (p.36). 
=e. > fae (2.63) 
q q 
and q = 1 — p gives 
Pp 
— = e° 2.64 
pone (2.64) 
=> ee : (2.65) 
P = ai 1+.exp(—a) f 
The hyperbolic tangent is 
tanh(a) = — (2.66) 


41 


To think about: does this uniform 
distribution contradict the 
central-limit theorem? 
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SO 














1 1 /fl—-e% 
F(a) l+exp(—a) 2 (; ene ) 
1 e@/2 E e72/2 1 
ae: (= 1j] = 5 (tanh(a/2) +1). (2.67) 
In the case b = logy p/q, we can repeat steps (2.63-2.65), replacing e by 2, 
to obtain 
ee (2.68) 
ee ie 
Solution to exercise 2.18 (p.36). 
Ply|@)P(2) 
P(«|y) = = 2.69 
(ely) rO (2.69) 
P(x=1 P =1) P(x=1 
„ P@=tly) _ Pwl2=1) P(e=1) Be 
P(x=0]|y) P(y|z=0) P(a=0) 
P(x=1|y) P(y|x=1) P(@=1) 
= log = log + log ; 2.71 
P(e=0]y) Pyle=0) t Epez 71) 


Solution to exercise 2.19 (p.36). The conditional independence of dı and d2 
given x means 


P(x, d1, da) = P(x)P(dy |x) P(dg |x). (2.72) 


This gives a separation of the posterior probability ratio into a series of factors, 
one for each data point, times the prior probability ratio. 





Pw=1[{d}) _ Pl{di}|e=1) PED a 
P(x=0| {di}) P({di}|=0) P(x =0) 

Pdi eel Pld |2=1) P(w=1) 

~ P(d,|2=0) P(dz|a2=0) P(x=0 ee) 


Life in high-dimensional spaces 


Solution to exercise 2.20 (p.37). The volume of a hypersphere of radius r in 


N dimensions is in fact 
TN y 
V(r, N) = WID 5 (2.75) 


but you don’t need to know this. For this question all that we need is the 


r-dependence, V(r, N) œ r^. So the fractional volume in (r — e,r) is 


ae eile a 


The fractional volumes in the shells for the required cases are: 


N 2 10 1000 


e/r=0.01 0.02 0.096 0.99996 
e/r=0.5 0.75 0.999 1 -— 271000 


Notice that no matter how small € is, for large enough N essentially all the 
probability mass is in the surface shell of thickness e. 
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2.10: Solutions 


Solution to exercise 2.21 (p.37). Pa=0.1, pp=0.2, pe=0.7. f(a)=10, 
f(b) =5, and f(c) =10/7. 











E[f(x)] =0.1 x 1040.2 x 5+0.7 x 10/7 = 3. (2.77) 
For each x, f(a) = 1/P(x), so 
E [1/P(«)| = E [f (£)] = 3. (2.78) 
Solution to exercise 2.22 (p.37). For general X, 
E[/P(&)] = >> P(æ)1/P(&)= XD 1=|Axl. (2.79) 
te Ax reAx 
Solution to exercise 2.23 (p.37). pa=0.1, pp =0.2, p-=0.7. g(a) =0, g(b) =1, 
and g(c) =0. 
E[9(x)] = pp = 0.2. (2.80) 
Solution to exercise 2.24 (p.37). 
P (P(x) € (0.15, 0.5]) = py = 0.2. (2.81) 
P ( log Pe) > 0.05) = pat pe = 0.8. (2.82) 








Solution to exercise 2.25 (p.37). This type of question can be approached in 
two ways: either by differentiating the function to be maximized, finding the 
maximum, and proving it is a global maximum; this strategy is somewhat 
risky since it is possible for the maximum of a function to be at the boundary 
of the space, at a place where the derivative is not zero. Alternatively, a 
carefully chosen inequality can establish the answer. The second method is 
much neater. 


Proof by differentiation (not the recommended method). Since it is slightly 
easier to differentiate In1/p than log, 1/p, we temporarily define H(X) to be 
measured using natural logarithms, thus scaling it down by a factor of logs e. 





1 

H(X) = 2 pila (2.83) 
OH(X) 1 

= n—-1 2.84 

Opi ” Bi eee) 


we maximize subject to the constraint `; p; = 1 which can be enforced with 
a Lagrange multiplier: 





Glo) = H(X)+À > Pi — 1) (2.85) 
OG(p) 1 
= In—-1+2. 2.86 
Opi Pi oe) 
At a maximum, 

1 
n>-1+à = 0 (2.87) 

Pi 
> n= fe oi oe (2.88) 


so all the p; are equal. That this extremum is indeed a maximum is established 


by finding the curvature: 
PG 1 
3Gp) _ -2j (2.89) 
Op,Op; Di 











which is negative definite. 
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44 2 — Probability, Entropy, and Inference 


Proof using Jensen's inequality (recommended method). First a reminder of 
the inequality. 


If f is a convex ~ function and zx is a random variable then: 
E(f(x)] = f (Elz). 


If f is strictly convex ~ and € [f(x)] =f (E[a]), then the random 
variable x is a constant (with probability 1). 


The secret of a proof using Jensen’s inequality is to choose the right func- 
tion and the right random variable. We could define 


fu) = log = = — logu (2.90) 


(which is a convex function) and think of H(X) = X` p; log > as the mean of 
f(u) where u = P(x), but this would not get us there — it would give us an 
inequality in the wrong direction. If instead we define 


u = 1/P(x) (2.91) 
then we find: 
A(X) = —€[f(1/P(2))| < -f (E[1/P(2)]) ; (2.92) 
now we know from exercise 2.22 (p.37) that €[1/P(x)] = |Ax|, so 
H(X) < -f (|Ax|) = log | Axl. (2.93) 


Equality holds only if the random variable u = 1/P(x) is a constant, which 
means P(x) is a constant for all a. o 





Solution to exercise 2.26 (p.37). 








DaulPIIQ) =D P(o) o y (2.94) 
We prove Gibbs’ inequality using Jensen’s inequality. Let f(u) = log 1/u and 
u = Q. Then 
P(x) 
Dki(PIIQ) = Elf(Q(x)/P(2))] (2.95) 


IV 


la) Vice ete ce a 
j (Erog2) -v (say) a 


with equality only if u = E is a constant, that is, if Q(x) = P(x). 

















Second solution. In the above proof the expectations were with respect to 
the probability distribution P(x). A second solution method uses Jensen’s 








inequality with Q(x) instead. We define f(u) = ulogu and let u = a. 
Then 
Pxv(PIIQ) = w/e E = aces (FE) eon 
P(@)\ _ $ 
ai (= awe) Z =0, (2.98) 
P(x) 











with equality only if u = OG) is a constant, that is, if Q(x) = P(x). 
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Solution to exercise 2.28 (p.38). 
H(X) = Hə(f) + fHo(g) + (1 — f) Hah). (2.99) 


Solution to exercise 2.29 (p.38). The probability that there are «—1 tails and 
then one head (so we get the first head on the zth toss) is 


P(@) = (1 — ff. (2.100) 


If the first toss is a tail, the probability distribution for the future looks just 
like it did before we made the first toss. Thus we have a recursive expression 
for the entropy: 

H(X) = Ho(f)+(1— f)H(X). (2.101) 


Rearranging, 


H(X) = A2(f)/f. (2.102) 
Solution to exercise 2.34 (p.38). The probability of the number of tails t is 


P(t)= (3) 3 for t > 0. (2.103) 


The expected number of heads is 1, by definition of the problem. The expected 
number of tails is 


[ee t 0.5 - A 
1\ 1 P 
E=) (5) (2.104) 04 - (f) 
t=0 0.3 - 
which may be shown to be 1 in a variety of ways. For example, since the 0.2 - 
situation after one tail is thrown is equivalent to the opening situation, we can 01- 
write down the recurrence relation E i | | 
1 1 0 02 04 06 os 14 
Eft] = =(14+ Elt) + -0 = Eft} =1. (2.105) 2 
2 2 f 
The probability distribution of the ‘estimator’ f = 1/(1 + t), given that Figure 2.12. The probability 
f = 1/2, is plotted in figure 2.12. The probability of f is simply the probability distribution of the estimator 
of the corresponding value of t. f =1/(1 +t), given that f = 1/2. 


Solution to exercise 2.35 (p.38). 


(a) The mean number of rolls from one six to the next six is six (assuming 
we start counting rolls after the first of the two sixes). The probability 
that the next six occurs on the rth roll is the probability of not getting 
a six for r — 1 rolls multiplied by the probability of then getting a six: 


r—1 
1 
P(ri=r)= G) 5 for r € {1,2,3,...}. (2.106) 


This probability distribution of the number of rolls, r, may be called an 
exponential distribution, since 


P(ri=r) =e “/Z, (2.107) 
where a = In(6/5), and Z is a normalizing constant. 
(b) The mean number of rolls from the clock until the next six is six. 


(c) The mean number of rolls, going back in time, until the most recent six 
is Six. 
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(a) 


(e) 


2 — Probability, Entropy, and Inference 


The mean number of rolls from the six before the clock struck to the six 
after the clock struck is the sum of the answers to (b) and (c), less one, 
that is, eleven. 


Rather than explaining the difference between (a) and (d), let me give 
another hint. Imagine that the buses in Poissonville arrive indepen- 
dently at random (a Poisson process), with, on average, one bus every 
six minutes. Imagine that passengers turn up at bus-stops at a uniform 
rate, and are scooped up by the bus without delay, so the interval be- 
tween two buses remains constant. Buses that follow gaps bigger than 
six minutes become overcrowded. The passengers’ representative com- 
plains that two-thirds of all passengers found themselves on overcrowded 
buses. The bus operator claims, ‘no, no — only one third of our buses 
are overcrowded’. Can both these claims be true? 


Solution to exercise 2.38 (p.39). 


Binomial distribution method. From the solution to exercise 1.2, pg = 


3P- f) + fe. 


Sum rule method. The marginal probabilities of the eight values of r are 


Solution to exercise 2.39 (p.40). 


illustrated by: 


P(r =000) = 1⁄2(1 — f) + 2f’, (2.108) 
P(r =001) = 2f — f)? + 2P — f) = 2f — f). (2.109) 
The posterior probabilities are represented by 
f? 
P(s=1|r=000) = ajf (2.110) 
7 a-p 
P(s=1|r=001) = -PaT (2.111) 
The probabilities of error in these representative cases are thus 
fe 
P(error | r = 000) = UFF (2.112) 
and 
P(error |r = 001) = f. (2.113) 


Notice that while the average probability of error of Rg is about 3 f2, the 
probability (given r) that any particular bit is wrong is either about f? 
or f. 


The average error probability, using the sum rule, is 


P(error) = XC P(r) P(error |r) 


fe 
G= 


[Ved — f)? + 12£°] + 6[L/ef (1 — AIF. 


So 


P(error) =f? + 3f7(1 =f). 


The entropy is 9.7 bits per word. 





0.15 - 
0.1 5 


0.05 4 o 














0 
0 5 10 15 20 
Figure 2.13. The probability 
distribution of the number of rolls 
rı from one 6 to the next (falling 
solid line), 


and the probability distribution 
(dashed line) of the number of 
rolls from the 6 before 1pm to the 
next 6, tots 


ETORO 


The probability P(rı > 6) is 
about 1/3; the probability 
P(rtot > 6) is about 2/3. The 
mean of rı is 6, and the mean of 
Ttot is 11. 


The first two terms are for the 
cases r = 000 and 111; the 
remaining 6 are for the other 
outcomes, which share the same 
probability of occurring and 
identical error probability, f. 
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About Chapter 3 


If you are eager to get on to information theory, data compression, and noisy 
channels, you can skip to Chapter 4. Data compression and data modelling 
are intimately connected, however, so you’ll probably want to come back to 
this chapter by the time you get to Chapter 6. Before reading Chapter 3, it 
might be good to look at the following exercises. 


> Exercise 3.1.1% P-59] A die is selected at random from two twenty-faced dice 


on which the symbols 1-10 are written with nonuniform frequency as 


follows. 
Symbol 1 2 3 4 5 6 7 8 9 10 
Number of faces of de A 6 4 3 2 1 1 1 1 1 0 
Number of faces of deB 3 3 2 2 2 2 2 2 1 1 


The randomly chosen die is rolled 7 times, with the following outcomes: 
5, 3, 9, 3, 8, 4, 7. 


What is the probability that the die is die A? 

> Exercise 3.2.7 P-59] Assume that there is a third twenty-faced die, die C, on 
which the symbols 1-20 are written once each. As above, one of the 
three dice is selected at random and rolled 7 times, giving the outcomes: 
3, 5, 4, 8, 3, 9, 7. 
What is the probability that the die is (a) die A, (b) die B, (c) die C? 

Exercise 3.3.1% P-48] Inferring a decay constant 

> Unstable particles are emitted from a source and decay at a distance 
x, a real number that has an exponential probability distribution with 
characteristic length A. Decay events can be observed only if they occur 


in a window extending from x = 1cm to xz = 20cm. N decays are 
observed at locations {171,...,¢y}. What is A? 
71 
> Exercise 3.4.9 P-55] Forensic evidence 


Two people have left traces of their own blood at the scene of a crime. A 
suspect, Oliver, is tested and found to have type ‘O’ blood. The blood 
groups of the two traces are found to be of type ‘O’ (a common type 
in the local population, having frequency 60%) and of type ‘AB’ (a rare 
type, with frequency 1%). Do these data (type ʻO’ and ‘AB’ blood were 
found at scene) give evidence in favour of the proposition that Oliver 
was one of the two people present at the crime? 


47 
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More about Inference 


It is not a controversial statement that Bayes’ theorem provides the correct 
language for describing the inference of a message communicated over a noisy 
channel, as we used it in Chapter 1 (p.6). But strangely, when it comes to 
other inference problems, the use of Bayes’ theorem is not so widespread. 


> 3.1 A first inference problem 


When I was an undergraduate in Cambridge, I was privileged to receive su- 
pervisions from Steve Gull. Sitting at his desk in a dishevelled office in St. 
John’s College, I asked him how one ought to answer an old Tripos question 
(exercise 3.3): 


Unstable particles are emitted from a source and decay at a 
distance x, a real number that has an exponential probability dis- 
tribution with characteristic length A. Decay events can be ob- 
served only if they occur in a window extending from z = 1cm 
to z = 20cm. N decays are observed at locations {x£1,..., £N}. 
What is A? 


d [trr * Ok 
< x > 


I had scratched my head over this for some time. My education had provided 
me with a couple of approaches to solving such inference problems: construct- 
ing ‘estimators’ of the unknown parameters; or ‘fitting’ the model to the data, 
or to a processed version of the data. 





Since the mean of an unconstrained exponential distribution is A, it seemed 
reasonable to examine the sample mean 7 = „n/N and see if an estimator 
Â could be obtained from it. It was evident that the estimator \ = Z—1 would 
be appropriate for A < 20cm, but not for cases where the truncation of the 
distribution at the right-hand side is significant; with a little ingenuity and 
the introduction of ad hoc bins, promising estimators for A >> 20 cm could be 
constructed. But there was no obvious estimator that would work under all 
conditions. 

Nor could I find a satisfactory approach based on fitting the density P(a | A) 
to a histogram derived from the data. I was stuck. 

What is the general solution to this problem and others like it? Is it 
always necessary, when confronted by a new inference problem, to grope in the 
dark for appropriate ‘estimators’ and worry about finding the ‘best’ estimator 
(whatever that means)? 
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3.1: A first inference problem 











0.25 
P(x|lambda=2) ——— 
P(x|lambda=5) ------- 
0.2 P(x|lambda=10) -------- 
0.15 A 
0.1 
0.05 | i 
0: oe es ee 
2 4 6 8 10 12 14 16 18 20 T 
0.2 
AN P(x=3|lambda) 
pp `A, _P(x=5|lambda) 
gas \P(x=12|lambda) 


I 
N 
\ 
1 

1 

1 








100 À 


Steve wrote down the probability of one data point, given A: 


1 e-2/d 
Pela =f 5 e™®A/Z(A) 1< æ <20 (3.1) 
0 otherwise 
where 
sues | —2/ —1/A _ „—20/À 
z= f dz şe =(e —e ): (3.2) 
This seemed obvious enough. Then he wrote Bayes’ theorem: 
PHz} | APA) 
P(A | {2£1,..., £N c 3.3 
1 N 
———; exp ( — n/A) P(A). 3.4 
AEE CE A) PA. BA 


Suddenly, the straightforward distribution P({x1,...,£Ẹ}]| Aà), defining the 
probability of the data given the hypothesis A, was being turned on its head 
so as to define the probability of a hypothesis given the data. A simple figure 
showed the probability of a single data point P(a | A) as a familiar function of x, 
for different values of À (figure 3.1). Each curve was an innocent exponential, 
normalized to have area 1. Plotting the same function as a function of A for a 
fixed value of x, something remarkable happens: a peak emerges (figure 3.2). 
To help understand these two points of view of the one function, figure 3.3 
shows a surface plot of P(x |A) as a function of z and A. 

For a dataset consisting of several points, e.g., the six points {aA = 
{1.5, 2,3,4,5,12}, the likelihood function P({x} |A) is the product of the N 
functions of A, P(ay | A) (figure 3.4). 


1.4e-06 
1.2e-06 | 
1e-06 
8e-07 
6e-07 - 
4e-07 
20-07 | 
0 
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Figure 3.1. The probability 
density P(x | A) as a function of zx. 


Figure 3.2. The probability 
density P(x |A) as a function of A, 
for three different values of x. 
When plotted this way round, the 
function is known as the likelihood 
of A. The marks indicate the 
three values of A, A = 2,5, 10, that 
were used in the preceding figure. 





Figure 3.3. The probability 
density P(x | A) as a function of x 
and à. Figures 3.1 and 3.2 are 
vertical sections through this 
surface. 


Figure 3.4. The likelihood function 
in the case of a six-point dataset, 
P({x} = {1.5, 2,3, 4,5, 12}| à), as 
a function of À. 
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50 3 — More about Inference 


Steve summarized Bayes’ theorem as embodying the fact that 


what you know about A after the data arrive is what you knew 


before [P(A)], and what the data told you [P({x}|A)]. 





Probabilities are used here to quantify degrees of belief. To nip possible 
confusion in the bud, it must be emphasized that the hypothesis A that cor- 
rectly describes the situation is not a stochastic variable, and the fact that the 
Bayesian uses a probability distribution P does not mean that he thinks of 
the world as stochastically changing its nature between the states described 
by the different hypotheses. He uses the notation of probabilities to represent 
his beliefs about the mutually exclusive micro-hypotheses (here, values of A), 
of which only one is actually true. That probabilities can denote degrees of 
belief, given assumptions, seemed reasonable to me. 
The posterior probability distribution (3.4) represents the unique and com- 
plete solution to the problem. There is no need to invent ‘estimators’; nor do 
we need to invent criteria for comparing alternative estimators with each other. 
Whereas orthodox statisticians offer twenty ways of solving a problem, and an- 
other twenty different criteria for deciding which of these solutions is the best, 
Bayesian statistics only offers one answer to a well-posed problem. If you have any difficulty 
understanding this chapter I 
recommend ensuring you are 
happy with exercises 3.1 and 3.2 
Our inference is conditional on our assumptions [for example, the prior P())]. (p.47) then noting their similarity 
Critics view such priors as a difficulty because they are ‘subjective’, but I don’t VOE ROTINE a: 
see how it could be otherwise. How can one perform inference without making 
assumptions? I believe that it is of great value that Bayesian methods force 
one to make these tacit assumptions explicit. 
First, once assumptions are made, the inferences are objective and unique, 
reproducible with complete agreement by anyone who has the same informa- 
tion and makes the same assumptions. For example, given the assumptions 
listed above, H, and the data D, everyone will agree about the posterior prob- 
ability of the decay length A: 


Assumptions in inference 


P(D|\,H)P(A|H) 


P|.) = Song 


(3.5) 

Second, when the assumptions are explicit, they are easier to criticize, and 
easier to modify — indeed, we can quantify the sensitivity of our inferences to 
the details of the assumptions. For example, we can note from the likelihood 
curves in figure 3.2 that in the case of a single data point at x = 5, the 
likelihood function is less strongly peaked than in the case x = 3; the details 
of the prior P(A) become increasingly important as the sample mean g gets 
closer to the middle of the window, 10.5. In the case x = 12, the likelihood 
function doesn’t have a peak at all — such data merely rule out small values 
of A, and don’t give any information about the relative probabilities of large 
values of A. So in this case, the details of the prior at the small—A end of things 
are not important, but at the large—A end, the prior is important. 

Third, when we are not sure which of various alternative assumptions is 
the most appropriate for a problem, we can treat this question as another 
inference task. Thus, given data D, we can compare alternative assumptions 
H using Bayes’ theorem: 


P(D|H,1)P(H|1) 


POH |D,1) = T 





(3.6) 
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3.2: The bent coin 


where J denotes the highest assumptions, which we are not questioning. 

Fourth, we can take into account our uncertainty regarding such assump- 
tions when we make subsequent predictions. Rather than choosing one partic- 
ular assumption H*, and working out our predictions about some quantity t, 
P(t|D,H*,I), we obtain predictions that take into account our uncertainty 
about H by using the sum rule: 


P(t| D, I) = X` P(t|D,H,DP(H|D,D. (3.7) 
H 


This is another contrast with orthodox statistics, in which it is conventional 
to ‘test’ a default model, and then, if the test ‘accepts the model’ at some 
‘significance level’, to use exclusively that model to make predictions. 

Steve thus persuaded me that 


probability theory reaches parts that ad hoc methods cannot reach. 


Let’s look at a few more examples of simple inference problems. 


> 3.2 The bent coin 


A bent coin is tossed F times; we observe a sequence s of heads and tails 
(which we’ll denote by the symbols a and b). We wish to know the bias of 
the coin, and predict the probability that the next toss will result in a head. 
We first encountered this task in example 2.7 (p.30), and we will encounter it 
again in Chapter 6, when we discuss adaptive data compression. It is also the 
original inference problem studied by Thomas Bayes in his essay published in 
1763. 

As in exercise 2.8 (p.30), we will assume a uniform prior distribution and 
obtain a posterior distribution by multiplying by the likelihood. A critic might 
object, ‘where did this prior come from?’ I will not claim that the uniform 
prior is in any way fundamental; indeed we’ll give examples of nonuniform 
priors later. The prior is a subjective assumption. One of the themes of this 
book is: 


you can’t do inference — or data compression — without making 


assumptions. 





We give the name Hı to our assumptions. [We’ll be introducing an al- 
ternative set of assumptions in a moment.] The probability, given pa, that F 
tosses result in a sequence s that contains {F,, F,} counts of the two outcomes 
is 

P(8| pa, F, H1) = pa*(1 — pa)”. (3.8) 
[For example, P(s = aaba | pa, F =4, H1) = papa(1 — pa)pa-] Our first model 
assumes a uniform prior distribution for pa, 


P(pa | Hı) =1, Pa E (0, 1] (3.9) 
and pp = 1 — pa. 


Inferring unknown parameters 


Given a string of length F of which Fa are as and F, are bs, we are interested 
in (a) inferring what pa might be; (b) predicting whether the next character is 
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an aor a b. [Predictions are always expressed as probabilities. So ‘predicting 
whether the next character is an a’ is the same as computing the probability 
that the next character is an a.] 

Assuming Hı to be true, the posterior probability of pa, given a string s 
of length F that has counts {F,, Fy}, is, by Bayes’ theorem, 


P(s | pa, Hi) P(pa | Hı) 
P(s | F, Hı) i 
The factor P(s | pa, F, Hı), which, as a function of pa, is known as the likeli- 


hood function, was given in equation (3.8); the prior P(pa|#H1) was given in 
equation (3.9). Our inference of pg is thus: 


P(pa|s, F, Hı) (3.10) 


F, F 
sa — pa)” 
P(pa |s, F,H aoe, 3.11 
(als Em) = Boe (3.11) 
The normalizing constant is given by the beta integral 
1 
(FA + DAR +1) Fa! P! 
P(s| F,H = dpa pë (1 — pa) = ——3 = aE O 
(EIF f Pa Pa`(1 — Pa) Tht hk) “(Gene 
(3.12) 


Exercise 3.5.1% P59] Sketch the posterior probability P(pa|s=aba, F =3). 
> What is the most probable value of pa (i.e., the value that maximizes 
the posterior probability density)? What is the mean value of pa under 

this distribution? 


Answer the same questions for the posterior probability 
P(p, |s= bbb, F =3). 
From inferences to predictions 


Our prediction about the next toss, the probability that the next toss is an a, 
is obtained by integrating over pa. This has the effect of taking into account 
our uncertainty about pa when making predictions. By the sum rule, 


P(als,F) = J dpa P(a | pa)P(pa |s, F). (3.13) 


The probability of an a given pg is simply pa, so 


a pk(1— pa)" 
P(als,F) = | drape (3.14) 
Z ae = Pale 
7 oP DEN 1 (3.15) 
— eee FF! er E 
7 EN hm] San o R 


which is known as Laplace’s rule. 


> 3.3 The bent coin and model comparison 


Imagine that a scientist introduces another theory for our data. He asserts 
that the source is not really a bent coin but is really a perfectly formed die with 
one face painted heads (‘a’) and the other five painted tails (‘b’). Thus the 
parameter pa, which in the original model, H1, could take any value between 
0 and 1, is according to the new hypothesis, Ho, not a free parameter at all; 
rather, it is equal to 1/6. [This hypothesis is termed Ho so that the suffix of 
each model indicates its number of free parameters. ] 

How can we compare these two models in the light of data? We wish to 
infer how probable H; is relative to Ho. 
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Model comparison as inference 


In order to perform model comparison, we write down Bayes’ theorem again, 
but this time with a different argument on the left-hand side. We wish to 
know how probable Hı is given the data. By Bayes’ theorem, 


P(s| F Ha)P(Ma) 


P F)= 1 
(Hı |s, ) P(s | F) (3 7) 
Similarly, the posterior probability of Ho is 
P(s|F P 
P(Ho|s, F) = P(s |F, Ho) P(Ho) (3.18) 


P(s| F) 


The normalizing constant in both cases is P(s| F), which is the total proba- 
bility of getting the observed data. If Hı and Ho are the only models under 
consideration, this probability is given by the sum rule: 


P(s| F) = P(s | F, H1)P(H1) + P(s| F, Ho)P(Ho). (3.19) 


To evaluate the posterior probabilities of the hypotheses we need to assign 
values to the prior probabilities P(H1) and P(Ho); in this case, we might 
set these to 1/2 each. And we need to evaluate the data-dependent terms 
P(s| F, Hı) and P(s|F,Ho). We can give names to these quantities. The 
quantity P(s | F,H1ı) is a measure of how much the data favour H1, and we 
call it the evidence for model Hı. We already encountered this quantity in 
equation (3.10) where it appeared as the normalizing constant of the first 
inference we made — the inference of pa given the data. 





How model comparison works: The evidence for a model is 


usually the normalizing constant of an earlier Bayesian inference. 





We evaluated the normalizing constant for model Hı in (3.12). The evi- 
dence for model Ho is very simple because this model has no parameters to 
infer. Defining po to be 1/6, we have 


P(s| F, Ho) = po*(1 — po)”. (3.20) 


Thus the posterior probability ratio of model Hı to model Ho is 


P(Hi|s,F) — P(s|F,Hı)P(Hı) (3.21) 
P(Ho|s,F) —— P(s| F,Ho)P(Ho) l 
= ae / oa =~ po)". (3.22) 


Some values of this posterior probability ratio are illustrated in table 3.5. The 
first five lines illustrate that some outcomes favour one model, and some favour 
the other. No outcome is completely incompatible with either model. With 
small amounts of data (six tosses, say) it is typically not the case that one of 
the two models is overwhelmingly more probable than the other. But with 
more data, the evidence against Ho given by any data set with the ratio Fa: Fp 
differing from 1:5 mounts up. You can’t predict in advance how much data 
are needed to be pretty sure which theory is true. It depends what pg is. 

The simpler model, Ho, since it has no adjustable parameters, is able to 
lose out by the biggest margin. The odds may be hundreds to one against it. 
The more complex model can never lose out by a large margin; there’s no data 
set that is actually unlikely given model H1. 
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> Exercise 3.7. 


54 
P(H, | sS F) 
F Data (Fa, F —_ 
( a ) P(Ho | S, F) 
6 (5,1) 222.2 
6 (3,3) 2.67 
6 (2,4) 0.71 =1/14 
6 (1,5) 0.356 = 1/2.8 
6 (0,6) 0.427 = 1/23 
20 (10, 10) 96.5 
20 (3,17) 0.2 =1/5 
20 (0, 20) 1.83 
Ho is true Hı is true 
Pa = 1/6 Pa = 0.25 Pa = 0.5 
A 100018 1000/1 Ê 1000/1 
4. 100/1 x 1001 4 100/1 
2 10/1 2 10/1 2 10/1 
0 Noon 0 114 0 1/1 
-2 1/10 -2 110 -2 1/10 
ma 1100 4 nt 17100 4. : 1/100 
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 
l 1000/1 à 1000/1 : a 1000/1 
al 100/1 4 1001 4 100/1 
2. 10/1 2 10/1 2 10/1 
0 eo 0 koa 0 11 
25 1/10 -2 110 -2 1/10 
-4 1100 4e 1/100 4! ! 1/100 
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 
A 10001 $] 1000/1 È | 1000/1 
4. 100/1 4. 1001 4. 100/1 
2. 10/1 2. 10/1 2. 10/1 
0 WW 0 11 0 1/1 
-2 1/10 25 110 -2 1/10 
At lanoo 4 E 1/100 “4: : 1/100 
0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 


> Exercise 3.6./7] Show that after F tosses have taken place, the biggest value 
that the log evidence ratio 


can have scales linearly with F if Hı is more probable, but the log 


P(s| F, Hı) 


J ANAA E, 
8 P(s| F, Ho) 


(3.23) 


evidence in favour of Ho can grow at most as log F. 


[3, p.60] 


Putting your sampling theory hat on, assuming Fa has 


not yet been measured, compute a plausible range that the log evidence 


ratio might lie in, as a function of F and the true value of pa, and sketch 


it as a function of F for pa = po = 1/6, pa = 0.25, and pa = 1/2. [Hint: 


sketch the log evidence as a function of the random variable Fa and work 


out the mean and standard deviation of Fa.] 


Typical behaviour of the evidence 


Figure 3.6 shows the log evidence ratio as a function of the number of tosses, 


F, in a number of simulated experiments. In the left-hand experiments, Ho 
was true. In the right-hand ones, Hı was true, and the value of pa was either 


0.25 or 0.5. 


We will discuss model comparison more in a later chapter. 


3 — More about Inference 


Table 3.5. Outcome of model 
comparison between models Hı 
and Ho for the ‘bent coin’. Model 
Ho states that pa = 1/6, pp» = 5/6. 


Figure 3.6. Typical behaviour of 
the evidence in favour of H as 
bent coin tosses accumulate under 
three different conditions 
(columns 1, 2, 3). Horizontal axis 
is the number of tosses, F. The 
vertical axis on the left is 

In pata, the right-hand 
vertical axis shows the values of 
P(s| F, Hı) 

P(s| F, Ho)’ 

The three rows show independent 
simulated experiments. 


(See also figure 3.8, p.60.) 
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3.4: An example of legal evidence 


> 3.4 An example of legal evidence 


The following example illustrates that there is more to Bayesian inference than 
the priors. 


Two people have left traces of their own blood at the scene of a 
crime. A suspect, Oliver, is tested and found to have type ‘O’ 
blood. The blood groups of the two traces are found to be of type 
‘O’ (a common type in the local population, having frequency 60%) 
and of type ‘AB’ (a rare type, with frequency 1%). Do these data 
(type ‘O’ and ‘AB’ blood were found at scene) give evidence in 
favour of the proposition that Oliver was one of the two people 
present at the crime? 


A careless lawyer might claim that the fact that the suspect’s blood type was 
found at the scene is positive evidence for the theory that he was present. But 
this is not so. 

Denote the proposition ‘the suspect and one unknown person were present’ 
by S. The alternative, S, states ‘two unknown people from the population were 
present’. The prior in this problem is the prior probability ratio between the 
propositions § and S$. This quantity is important to the final verdict and 
would be based on all other available information in the case. Our task here is 
just to evaluate the contribution made by the data D, that is, the likelihood 
ratio, P(D| S,H)/P(D|S,H). In my view, ajury’s task should generally be to 
multiply together carefully evaluated likelihood ratios from each independent 
piece of admissible evidence with an equally carefully reasoned prior proba- 
bility. [This view is shared by many statisticians but learned British appeal 
judges recently disagreed and actually overturned the verdict of a trial because 
the jurors had been taught to use Bayes’ theorem to handle complicated DNA 
evidence.] 

The probability of the data given S is the probability that one unknown 
person drawn from the population has blood type AB: 


P(D|S,H) = pas (3.24) 


(since given S, we already know that one trace will be of type O). The prob- 
ability of the data given S$ is the probability that two unknown people drawn 
from the population have types O and AB: 


P(D|S,H) = 2po pas. (3.25) 


In these equations H denotes the assumptions that two people were present 
and left blood there, and that the probability distribution of the blood groups 
of unknown people in an explanation is the same as the population frequencies. 
Dividing, we obtain the likelihood ratio: 
P(D|S,H) 1 1 


PDSH) 2mo = 0.83. 2 
P(D|S,H) 2po 2x0.6 aes (3.26) 





Thus the data in fact provide weak evidence against the supposition that 
Oliver was present. 

This result may be found surprising, so let us examine it from various 
points of view. First consider the case of another suspect, Alberto, who has 
type AB. Intuitively, the data do provide evidence in favour of the theory S’ 
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that this suspect was present, relative to the null hypothesis 5. And indeed 
the likelihood ratio in this case is: 


P(D|S',H) 1 
pie Se eS Cacao = 50. 3.27 
P(D|S,H)  2pas ee) 





Now let us change the situation slightly; imagine that 99% of people are of 
blood type O, and the rest are of type AB. Only these two blood types exist 
in the population. The data at the scene are the same as before. Consider 
again how these data influence our beliefs about Oliver, a suspect of type 
O, and Alberto, a suspect of type AB. Intuitively, we still believe that the 
presence of the rare AB blood provides positive evidence that Alberto was 
there. But does the fact that type O blood was detected at the scene favour 
the hypothesis that Oliver was present? If this were the case, that would mean 
that regardless of who the suspect is, the data make it more probable they were 
present; everyone in the population would be under greater suspicion, which 
would be absurd. The data may be compatible with any suspect of either 
blood type being present, but if they provide evidence for some theories, they 
must also provide evidence against other theories. 

Here is another way of thinking about this: imagine that instead of two 
people’s blood stains there are ten, and that in the entire local population 
of one hundred, there are ninety type O suspects and ten type AB suspects. 
Consider a particular type O suspect, Oliver: without any other information, 
and before the blood test results come in, there is a one in 10 chance that he 
was at the scene, since we know that 10 out of the 100 suspects were present. 
We now get the results of blood tests, and find that nine of the ten stains are 
of type AB, and one of the stains is of type O. Does this make it more likely 
that Oliver was there? No, there is now only a one in ninety chance that he 
was there, since we know that only one person present was of type O. 

Maybe the intuition is aided finally by writing down the formulae for the 
general case where no blood stains of individuals of type O are found, and 
nap of type AB, a total of N individuals in all, and unknown people come 
from a large population with fractions po,pas. (There may be other blood 
types too.) The task is to evaluate the likelihood ratio for the two hypotheses: 
S, ‘the type O suspect (Oliver) and N—1 unknown others left N stains’; and 
S, ‘N unknowns left N stains’. The probability of the data under hypothesis 
S is just the probability of getting no, nap individuals of the two types when 
N individuals are drawn at random from the population: 


P(no,nap| S$) = PO PAB (3.28) 


NO! NAB 
In the case of hypothesis S, we need the distribution of the N—1 other indi- 
viduals: (Vv —1) 
P S) = —— "pb. 3.29 

(no, nas | 8) (no — 1)! ao PAB ( ) 


The likelihood ratio is: 





P(no, nas| S) _ no/N 


satin = 3.30 
P(no, nap |S) PO (530) 


This is an instructive result. The likelihood ratio, i.e. the contribution of 
these data to the question of whether Oliver was present, depends simply on 
a comparison of the frequency of his blood type in the observed data with the 
background frequency in the population. There is no dependence on the counts 
of the other types found at the scene, or their frequencies in the population. 
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If there are more type O stains than the average number expected under 
hypothesis S, then the data give evidence in favour of the presence of Oliver. 
Conversely, if there are fewer type O stains than the expected number under 
S, then the data reduce the probability of the hypothesis that he was there. 
In the special case ng /N = po, the data contribute no evidence either way, 
regardless of the fact that the data are compatible with the hypothesis S. 


> 3.5 Exercises 


[2, p.60] 


Exercise 3.8. The three doors, normal rules. 


= On a game show, a contestant is told the rules as follows: 

There are three doors, labelled 1, 2, 3. A single prize has 
been hidden behind one of them. You get to select one door. 
Initially your chosen door will not be opened. Instead, the 
gameshow host will open one of the other two doors, and he 
will do so in such a way as not to reveal the prize. For example, 
if you first choose door 1, he will then open one of doors 2 and 
3, and it is guaranteed that he will choose which one to open 
so that the prize will not be revealed. 

At this point, you will be given a fresh choice of door: you 
can either stick with your first choice, or you can switch to the 
other closed door. All the doors will then be opened and you 
will receive whatever is behind your final choice of door. 


Imagine that the contestant chooses door 1 first; then the gameshow host 
opens door 3, revealing nothing behind the door, as promised. Should 
the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does 
it make no difference? 


~~ Exercise 3.9 [2 p-61] The three doors, earthquake scenario. 


Imagine that the game happens again and just as the gameshow host is 
about to open one of the doors a violent earthquake rattles the building 
and one of the three doors flies open. It happens to be door 3, and it 
happens not to have the prize behind it. The contestant had initially 
chosen door 1. 


Repositioning his toupée, the host suggests, ‘OK, since you chose door 
1 initially, door 3 is a valid door for me to open, according to the rules 
of the game; I’ll let door 3 stay open. Let’s carry on as if nothing 
happened.’ 


Should the contestant stick with door 1, or switch to door 2, or does it 
make no difference? Assume that the prize was placed randomly, that 
the gameshow host does not know where it is, and that the door flew 
open because its latch was broken by the earthquake. 


[A similar alternative scenario is a gameshow whose confused host for- 
gets the rules, and where the prize is, and opens one of the unchosen 
doors at random. He opens door 3, and the prize is not revealed. Should 
the contestant choose what’s behind door 1 or door 2? Does the opti- 
mal decision for the contestant depend on the contestant’s beliefs about 
whether the gameshow host is confused or not?] 
> Exercise 3.10.[?] Another example in which the emphasis is not on priors. You 
visit a family whose three children are all at the local school. You don’t 
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know anything about the sexes of the children. While walking clum- 
sily round the home, you stumble through one of the three unlabelled 
bedroom doors that you know belong, one each, to the three children, 
and find that the bedroom contains girlie stuff in sufficient quantities to 
convince you that the child who lives in that bedroom is a girl. Later, 
you sneak a look at a letter addressed to the parents, which reads ‘From 
the Headmaster: we are sending this letter to all parents who have male 
children at the school to inform them about the following boyish mat- 
ters...’. 


These two sources of evidence establish that at least one of the three 
children is a girl, and that at least one of the children is a boy. What 
are the probabilities that there are (a) two girls and one boy; (b) two 
boys and one girl? 


> Exercise 3.11.1% P6! Mrs S is found stabbed in her family garden. Mr S 
behaves strangely after her death and is considered as a suspect. On 
investigation of police and social records it is found that Mr S had beaten 
up his wife on at least nine previous occasions. The prosecution advances 
this data as evidence in favour of the hypothesis that Mr S is guilty of the 
murder. ‘Ah no,’ says Mr S’s highly paid lawyer, ‘statistically, only one 
in a thousand wife-beaters actually goes on to murder his wife.' So the 
wife-beating is not strong evidence at all. In fact, given the wife-beating 
evidence alone, it’s extremely unlikely that he would be the murderer of 
his wife — only a 1/1000 chance. You should therefore find him innocent.’ 


Is the lawyer right to imply that the history of wife-beating does not 
point to Mr S’s being the murderer? Or is the lawyer a slimy trickster? 
If the latter, what is wrong with his argument? 


[Having received an indignant letter from a lawyer about the preceding 

paragraph, Td like to add an extra inference exercise at this point: Does 

my suggestion that Mr. S.’s lawyer may have been a slimy trickster imply 

that I believe all lawyers are slimy tricksters? (Answer: No.)] 

> Exercise 3.12.17] A bag contains one counter, known to be either white or 
black. A white counter is put in, the bag is shaken, and a counter 
is drawn out, which proves to be white. What is now the chance of 
drawing a white counter? [Notice that the state of the bag, after the 
operations, is exactly identical to its state before.] 

> Exercise 3.13.1% P62] You move into a new house; the phone is connected, and 
yow’re pretty sure that the phone number is 740511, but not as sure as 
you would like to be. As an experiment, you pick up the phone and 
dial 740511; you obtain a ‘busy’ signal. Are you now more sure of your 
phone number? If so, how much? 


> Exercise 3.14.4] In a game, two coins are tossed. If either of the coins comes 
up heads, you have won a prize. To claim the prize, you must point to 
one of your coins that is a head and say ‘look, that coin’s a head, I’ve 
won’. You watch Fred play the game. He tosses the two coins, and he 





‘In the U.S.A., it is estimated that 2 million women are abused each year by their partners. 
In 1994, 4739 women were victims of homicide; of those, 1326 women (28%) were slain by 
husbands and boyfriends. 
(Sources: http: //www.umn.edu/mincava/papers/factoid.htm, 
http://www. gunfree.inter.net/vpc/womenfs.htm) 
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points to a coin and says ‘look, that coin’s a head, I’ve won’. What is 

the probability that the other coin is a head? 

> Exercise 3.15.1% P63] A statistical statement appeared in The Guardian on 
Friday January 4, 2002: 


When spun on edge 250 times, a Belgian one-euro coin came 
up heads 140 times and tails 110. ‘It looks very suspicious 
to me’, said Barry Blight, a statistics lecturer at the London 
School of Economics. ‘If the coin were unbiased the chance of 
getting a result as extreme as that would be less than 7%’. 


But do these data give evidence that the coin is biased rather than fair? 
(Hint: see equation (3.22).] 


> 3.6 Solutions 


Solution to exercise 3.1 (p.47). Let the data be D. Assuming equal prior 
probabilities, 


P(A|D) 1313121 9 (3.31) 
P(B|D) 2212222 32 l 


and P(A| D) = 9/41. 


Solution to exercise 3.2 (p.47). The probability of the data given each hy- 


pothesis is: 
3121311 18 
P(D|A) = = Sar) 
20 20 20 20 202020 207 


2.2.2 2-2) 1.2 64 
P(D|B)= = ; 
D5) 20 20 20 20 20 2020 207’ 
1 111111 1 


PDG aSa =. 34 
(PC) = 5555 302050 20 20 ~ 207 es) 


(3.32) 





(3.33) 
































So 
18 18 64 1 
P(A| D) = ——— = =; P(B| D) = =; P(C | D) = —. 
a 18+64+1 837 oie) 83’ ce) 83 
(3.35) 

1 I 1 I I Figure 3.7. Posterior probability 
for the bias pa of a bent coin 
given two different data sets. 

(a) 0 02 0.4 06 08 1 (b) 0 02 04 06 0.8 1 





P(pa|s=aba, F =3) x p2(1 — pa) P(pa | s = bbb, F =3) œ (1 pa)? 


Solution to exercise 3.5 (p.52). 


(a) P(pa|s= aba, F=3) x p2(1 — pa). The most probable value of pa (i.e., 
the value that maximizes the posterior probability density) is 2/3. The 
mean value of pa is 3/5. 


See figure 3.7a. 
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(b) P(pa|s=bbb, F=3) œ (1 — pa). The most probable value of pa (i.e., 
the value that maximizes the posterior probability density) is 0. The 
mean value of pa is 1/5. 


See figure 3.7b. 








Ho is true Hı is true Figure 3.8. Range of plausible 

; pa = 1/6 ; pa = 0.25 : eae r a log e a 7 
ia 10001 g | : 1000/1 5 | 1000/1 a AA A820, ER E 
4l 100/1 Ae eee 1001 4 | 100/1 Tiie verucal axis on the left is 
2 10/1 2 10/1 2 | 10/1 log P(s| FH)’ the right-hand 
Ok 1⁄1 0 eta i 0 t 1⁄1 vertical axis shows the values of 
| 1/10 Bey aa rns on ee acl 110 -2 1/10 P(s| F,Hı) 
Al 4/100 AE 1/100 4 | 1/100 P(s|F,Ho)* 

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 The solid line shows the log 


evidence if the random variable 


p . ; Fa takes on its mean value, 
Solution to exercise 3.7 (p.54). The curves in figure 3.8 were found by finding F, = paF. The dotted lines show 


the mean and standard deviation of Fa, then setting Fa to the mean + two (approximately) the log evidence 
standard deviations to get a 95% plausible range for Fa, and computing the if Fa is at its 2.5th or 97.5th 


three corresponding values of the log evidence ratio. percentile. 
(See also figure 3.6, p.54.) 





Solution to exercise 3.8 (p.57). Let H; denote the hypothesis that the prize is 
behind door i. We make the following assumptions: the three hypotheses H1, 
Hz and Hz are equiprobable a priori, i.e., 


P(t) = P(Ha) = PHa) = 5. (3.36) 
The datum we receive, after choosing door 1, is one of D=3 and D=2 (mean- 
ing door 3 or 2 is opened, respectively). We assume that these two possible 
outcomes have the following probabilities. If the prize is behind door 1 then 
the host has a free choice; in this case we assume that the host selects at 
random between D=2 and D=3. Otherwise the choice of the host is forced 


and the probabilities are 0 and 1. 


| P(D=2|H1)= Ye (3.37) 


P(D=3|H1)=1⁄2 





P(D=2|H2)=0 | P(D=2|H3)=1 
P(D=3|H2)=1 | P(D=3|H3)= 


Now, using Bayes’ theorem, we evaluate the posterior probabilities of the 


hypotheses: 
P(D=3|Hi)P(Hi) 
P(D=3) 


oer) oe OG) (0)0/3) 
(P | D=3) = | P(H2|D=3) = SES | Ps | D=3) = HES | 
(3.39) 
The denominator P(D =3) is (1/2) because it is the normalizing constant for 
this posterior distribution. So 


P(H;|D=3) = (3.38) 











| P(W1|D=3) = 13| P(H2|D=3) = %/3| P(H3|D=3) = 0.| 
(3.40) 
So the contestant should switch to door 2 in order to have the biggest chance 
of getting the prize. 

Many people find this outcome surprising. There are two ways to make it 
more intuitive. One is to play the game thirty times with a friend and keep 
track of the frequency with which switching gets the prize. Alternatively, you 
can perform a thought experiment in which the game is played with a million 
doors. The rules are now that the contestant chooses one door, then the game 
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show host opens 999,998 doors in such a way as not to reveal the prize, leaving 
the contestant’s selected door and one other door closed. The contestant may 
now stick or switch. Imagine the contestant confronted by a million doors, 
of which doors 1 and 234,598 have not been opened, door 1 having been the 
contestant’s initial guess. Where do you think the prize is? 


Solution to exercise 3.9 (p.57). If door 3 is opened by an earthquake, the 
inference comes out differently — even though visually the scene looks the 
same. The nature of the data, and the probability of the data, are both 
now different. The possible data outcomes are, firstly, that any number of 
the doors might have opened. We could label the eight possible outcomes 
d = (0,0,0), (0,0, 1), (0,1, 0), (1,0,0), (0,1,1),...,(1,1,1). Secondly, it might 
be that the prize is visible after the earthquake has opened one or more doors. 
So the data D consists of the value of d, and a statement of whether the prize 
was revealed. It is hard to say what the probabilities of these outcomes are, 
since they depend on our beliefs about the reliability of the door latches and 
the properties of earthquakes, but it is possible to extract the desired posterior 
probability without naming the values of P(d | H;) for each d. All that matters 
are the relative values of the quantities P(D|H,), P(D|H2), P(D | H3), for 
the value of D that actually occurred. [This is the likelihood principle, which 
we met in section 2.3.] The value of D that actually occurred is ‘d= (0,0, 1), 
and no prize visible’. First, it is clear that P(D|H3) = 0, since the datum 


























that no prize is visible is incompatible with H3. Now, assuming that the Where the prize is 
contestant selected door 1, how does the probability P(D | H1) compare with door door- doot 
P(D | H2)? Assuming that earthquakes are not sensitive to decisions of game 1 2 3 
show contestants, these two quantities have to be equal, by symmetry. We none | 22ers Pnone 
don’t know how likely it is that door 3 falls off its hinges, but however likely 3 3 3 
it is, it’s just as likely to do so whether the prize is behind door 1 or door 2. J i iol 
So, if P(D|#1) and P(D |H2) are equal, we obtain: E 
2 
P(H,|D) Pn Peers) P(H2|D) = Perey) P(H3|D) = Peay) 5 2 
= 1⁄2 = 1⁄2 =0. È B p3 
(3.41) g 3 3 
The two possible hypotheses are now equally likely. $ 
If we assume that the host knows where the prize is and might be acting F aa 
deceptively, then the answer might be further modified, because we have to 8 
view the host’s words as part of the data. Z bs 
Confused? It’s well worth making sure you understand these two gameshow = 
problems. Don’t worry, I slipped up on the second problem, the first time I S23 
met it. P1,2,3 P1,2,3 
There is a general rule which helps immensely when you have a confusing 1,2,3 | => 3 








probability problem: 





Figure 3.9. The probability of 
Always write down the probability of everything. everything, for the second 
(Steve Gull) three-door problem, assuming an 
earthquake has just occurred. 
Here, p3 is the probability that 

From this joint probability, any desired inference can be mechanically ob- door 3 alone is opened by an 
tained (figure 3.9). earthquake. 





Solution to exercise 3.11 (p.58). The statistic quoted by the lawyer indicates 
the probability that a randomly selected wife-beater will also murder his wife. 
The probability that the husband was the murderer, given that the wife has 
been murdered, is a completely different quantity. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


62 3 — More about Inference 


To deduce the latter, we need to make further assumptions about the 
probability that the wife is murdered by someone else. If she lives in a neigh- 
bourhood with frequent random murders, then this probability is large and 
the posterior probability that the husband did it (in the absence of other ev- 
idence) may not be very large. But in more peaceful regions, it may well be 
that the most likely person to have murdered you, if you are found murdered, 
is one of your closest relatives. 

Let’s work out some illustrative numbers with the help of the statistics 
on page 58. Let m=1 denote the proposition that a woman has been mur- 
dered; h=1, the proposition that the husband did it; and b=1, the propo- 
sition that he beat her in the year preceding the murder. The statement 
‘someone else did it’ is denoted by h=0. We need to define P(h|m=1), 
P(b|h=1,m=1), and P(b=1|h=0,m=1) in order to compute the pos- 
terior probability P(h=1|b=1,m=1). From the statistics, we can read 
out P(h=1|m=1) = 0.28. And if two million women out of 100 million 
are beaten, then P(b=1|h=0,m=1) = 0.02. Finally, we need a value for 
P(b|h=1,m=1): if a man murders his wife, how likely is it that this is the 
first time he laid a finger on her? I expect it’s pretty unlikely; so maybe 
P(b=1|h=1,m=1) is 0.9 or larger. 

By Bayes’ theorem, then, 








.9 x .28 
P(h=1|b=1,m=1) Ix 284 02x72 ~ 95%. (3.42) 











One way to make obvious the sliminess of the lawyer on p.58 is to construct 
arguments, with the same logical structure as his, that are clearly wrong. 
For example, the lawyer could say ‘Not only was Mrs. S murdered, she was 
murdered between 4.02pm and 4.03pm. Statistically, only one in a million 
wife-beaters actually goes on to murder his wife between 4.02pm and 4.03pm. 
So the wife-beating is not strong evidence at all. In fact, given the wife-beating 
evidence alone, it’s extremely unlikely that he would murder his wife in this 
way — only a 1/1,000,000 chance.’ 


Solution to exercise 3.13 (p.58). There are two hypotheses. Ho: your number 
is 740511; 711: it is another number. The data, D, are ‘when I dialed 740511, 
I got a busy signal’. What is the probability of D, given each hypothesis? If 
your number is 740511, then we expect a busy signal with certainty: 


P(D|Ho) = 1. 


On the other hand, if Hı is true, then the probability that the number dialled 
returns a busy signal is smaller than 1, since various other outcomes were also 
possible (a ringing tone, or a number-unobtainable signal, for example). The 
value of this probability P(D|Hj,) will depend on the probability a that a 
random phone number similar to your own phone number would be a valid 
phone number, and on the probability 6 that you get a busy signal when you 
dial a valid phone number. 

I estimate from the size of my phone book that Cambridge has about 
75000 valid phone numbers, all of length six digits. The probability that a 
random six-digit number is valid is therefore about 75000/10° = 0.075. If 
we exclude numbers beginning with 0, 1, and 9 from the random choice, the 
probability œ is about 75000/700000 ~ 0.1. If we assume that telephone 
numbers are clustered then a misremembered number might be more likely 
to be valid than a randomly chosen number; so the probability, a, that our 
guessed number would be valid, assuming Hı is true, might be bigger than 
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0.1. Anyway, a must be somewhere between 0.1 and 1. We can carry forward 
this uncertainty in the probability and see how much it matters at the end. 

The probability @ that you get a busy signal when you dial a valid phone 
number is equal to the fraction of phones you think are in use or off-the-hook 
when you make your tentative call. This fraction varies from town to town 
and with the time of day. In Cambridge, during the day, I would guess that 
about 1% of phones are in use. At 4am, maybe 0.1%, or fewer. 

The probability P(D|H,) is the product of a and @, that is, about 0.1 x 
0.01 = 1078. According to our estimates, there’s about a one-in-a-thousand 
chance of getting a busy signal when you dial a random number; or one-in-a- 
hundred, if valid numbers are strongly clustered; or one-in-10+, if you dial in 
the wee hours. 

How do the data affect your beliefs about your phone number? The pos- 
terior probability ratio is the likelihood ratio times the prior probability ratio: 

P(Ho|D) _ P(D| Ho) P(Ho) 


P(Hi|D) PDH) PH) An 





The likelihood ratio is about 100-to-1 or 1000-to-1, so the posterior probability 
ratio is swung by a factor of 100 or 1000 in favour of Ho. If the prior probability 
of Ho was 0.5 then the posterior probability is 


1 
1+ POTD) 


Solution to exercise 3.15 (p.59). We compare the models Ho — the coin is fair 
— and Hı — the coin is biased, with the prior on its bias set to the uniform 


distribution P(p|H1) = 1. [The use of a uniform prior seems reasonable 
to me, since I know that some coins, such as American pennies, have severe 2.08 \ mo T 
biases when spun on edge; so the situations p = 0.01 or p = 0.1 or p = 0.95 004) 
would not surprise me.] 0.03 
: ee 0.02 140 
When I mention Ho — the coin is fair — a pedant would say, ‘how absurd to even ie 


consider that the coin is fair — any coin is surely biased to some extent’. And = — ff 
of course I would agree. So will pedants kindly understand Ho as meaning ‘the 0! 


bera ae . : i ; 0 50 100 150 200 250 
coin is fair to within one part in a thousand, i.e., p € 0.5 + 0.001’. Figure 3.10. The probability 








The'likelikood ratos: distribution of the number of 
heads given the two hypotheses, 
140!110! that the coin is fair, and that it is 
PDH) an l = 0.48. (3.45) biased, with the prior distribution 
P(D|Ho) 1/2 of the bias being uniform. The 


: : f ; f outcome (D = 140 heads) gives 
Thus the data give scarcely any evidence either way; in fact they give weak Weak evidence in favour of Ho, the 


evidence (two to one) in favour of Ho! hypothesis that the coin is fair. 
‘No, no’, objects the believer in bias, ‘your silly uniform prior doesn’t 

represent my prior beliefs about the bias of biased coins — I was expecting only 

a small bias’. To be as generous as possible to the H1, let’s see how well it 

could fare if the prior were presciently set. Let us allow a prior of the form 


1 
P(p\Hy, a) = Za 0 —p)*', where Z(a) =T(a)?/T(2a) (3.46) 
a 
(a Beta distribution, with the original uniform prior reproduced by setting 
a = 1). By tweaking a, the likelihood ratio for Hı over Ho, 
P(D|Hi,a) — 1140+) T(110+a) P(2a)2?°° 


P(D[Ho) T(250+20) F(a)? (3.47) 
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can be increased a little. It is shown for several values of a in figure 3.11. 
Even the most favourable choice of a (œa œ 50) can yield a likelihood ratio of 
only two to one in favour of H1. 

In conclusion, the data are not ‘very suspicious’. They can be construed 
as giving at most two-to-one evidence in favour of one or other of the two 
hypotheses. 


Are these wimpy likelihood ratios the fault of over-restrictive priors? Is there 
any way of producing a ‘very suspicious’ conclusion? The prior that is best- 
matched to the data, in terms of likelihood, is the prior that sets p to f = 
140/250 with probability one. Let’s call this model H,. The likelihood ratio is 
P(D|H.)/P(D\Ho) = 27°° f1 (1 — f) 4° = 6.1. So the strongest evidence that 
these data can possibly muster against the hypothesis that there is no bias is 
six-to-one. 


While we are noticing the absurdly misleading answers that ‘sampling the- 
ory’ statistics produces, such as the p-value of 7% in the exercise we just solved, 
let’s stick the boot in. If we make a tiny change to the data set, increasing 
the number of heads in 250 tosses from 140 to 141, we find that the p-value 
goes below the mystical value of 0.05 (the p-value is 0.0497). The sampling 
theory statistician would happily squeak ‘the probability of getting a result as 
extreme as 141 heads is smaller than 0.05 — we thus reject the null hypothesis 
at a significance level of 5%’. The correct answer is shown for several values 
of a in figure 3.12. The values worth highlighting from this table are, first, 
the likelihood ratio when Hı uses the standard uniform prior, which is 1:0.61 
in favour of the null hypothesis Ho. Second, the most favourable choice of a, 
from the point of view of H1, can only yield a likelihood ratio of about 2.3:1 
in favour of Hı. 

Be warned! A p-value of 0.05 is often interpreted as implying that the odds 
are stacked about twenty-to-one against the null hypothesis. But the truth 
in this case is that the evidence either slightly favours the null hypothesis, or 
disfavours it by at most 2.3 to one, depending on the choice of prior. 

The p-values and ‘significance levels’ of classical statistics should be treated 
with extreme caution. Shun them! Here ends the sermon. 
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P(D/H1, a) 
à EEA, 
P(D|Ho) 
.37 25 
1.0 A8 
2.7 82 
7.4 1.3 
20 1.8 
55 1.9 
148 1.7 
403 1.3 
1096 1.1 


Figure 3.11. Likelihood ratio for 
various choices of the prior 
distribution’s hyperparameter a. 


P(D'|Hi, a) 
P(D'|Ho) 

.37 .32 

1.0 61 
2.7 1.0 
7.4 1.6 
20 2.2 
55 2.3 
148 1.9 
403 1.4 
1096 1.2 


Figure 3.12. Likelihood ratio for 
various choices of the prior 
distribution’s hyperparameter a, 
when the data are D’ = 141 heads 
in 250 trials. 
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Part I 


Data Compression 
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About Chapter 4 


In this chapter we discuss how to measure the information content of the 
outcome of a random experiment. 

This chapter has some tough bits. If you find the mathematical details 
hard, skim through them and keep going — you'll be able to enjoy Chapters 5 
and 6 without this chapter’s tools. 

Before reading Chapter 4, you should have read Chapter 2 and worked on 
exercises 2.21-2.25 and 2.16 (pp.36-37), and exercise 4.1 below. 

The following exercise is intended to help you think about how to measure 
information content. 


[2, p.69] 


«~~ Exercise 4.1. — Please work on this problem before reading Chapter 4. 


You are given 12 balls, all equal in weight except for one that is either 
heavier or lighter. You are also given a two-pan balance to use. In each 
use of the balance you may put any number of the 12 balls on the left 
pan, and the same number on the right pan, and push a button to initiate 
the weighing; there are three possible outcomes: either the weights are 
equal, or the balls on the left are heavier, or the balls on the left are 
lighter. Your task is to design a strategy to determine which is the odd 
ball and whether it is heavier or lighter than the others in as few uses 
of the balance as possible. 


While thinking about this problem, you may find it helpful to consider 
the following questions: 


(a) How can one measure information? 


(b) When you have identified the odd ball and whether it is heavy or 
light, how much information have you gained? 


(c) Once you have designed a strategy, draw a tree showing, for each 
of the possible outcomes of a weighing, what weighing you perform 
next. At each node in the tree, how much information have the 
outcomes so far given you, and how much information remains to 
be gained? 

(d) How much information is gained when you learn (i) the state of a 
flipped coin; (ii) the states of two flipped coins; (iii) the outcome 
when a four-sided die is rolled? 


(e) How much information is gained on the first step of the weighing 
problem if 6 balls are weighed against the other 6? How much is 
gained if 4 are weighed against 4 on the first step, leaving out 4 
balls? 
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TEA 
SCA 
SCA 
V=BUA 
V=BnA 


|A] 


Notation 


x is a member of the 
set A 

S is a subset of the 
set A 

S is a subset of, or 
equal to, the set A 
Y is the union of the 
sets B and A 

Y is the intersection 
of the sets B and A 
number of elements 
in set A 
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The Source Coding Theorem 


> 4.1 How to measure the information content of a random variable? 


In the next few chapters, we’ll be talking about probability distributions and 

random variables. Most of the time we can get by with sloppy notation, 

but occasionally, we will need precise notation. Here is the notation that we 

established in Chapter 2. 

An ensemble X is a triple (x, Ax, Px), where the outcome «x is the value 
of a random variable, which takes on one of a set of possible values, 
Ax = {a1,@9,...,@;,..., a7}, having probabilities Px = {pi, po,...,pr}, 
with P(x=a;) = pi, pi > 0 and Jaca P(w=ai) = 1. 


How can we measure the information content of an outcome x = a; from such 
an ensemble? In this chapter we examine the assertions 


1. that the Shannon information content, 
1 
h(x =a;) = logs —, (4.1) 
Pi 


is a sensible measure of the information content of the outcome x = aj, 
and 


2. that the entropy of the ensemble, 


1 
H(X) = X pilogs = (4.2) 
A (4 
(3 
is a sensible measure of the ensemble’s average information content. 
1 e r ; 
104 h(p) = logs = D h(p) H(p) H(p) Figure 4.1. The Shannon ; 
p SS eC 0 information content h(p) = logs = 
s- 0.001 10.0 0.011 2p 
_ 0.01 ee 0.081 PA and the binary entropy function 
; a a ae a, \ re a = 
z i : : ar \ plogs 5 + (1 — p) logs 2y asa 
Z Lo 0.2 2.3 0.72 o2-/ \ function of p. 
Se E a a 0.5 1.0 1.0 e EE EE 
1 


Figure 4.1 shows the Shannon information content of an outcome with prob- 
ability p, as a function of p. The less probable an outcome is, the greater 


its Shannon information content. Figure 4.1 also shows the binary entropy 
function, 


H2(p) = H(p,1—p) = plogs : + (1 — p) logs Ss (4.3) 


which is the entropy of the ensemble X whose alphabet and probability dis- 
tribution are Ay = {a,b}, Px = {p, (1 — p)}. 
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Information content of independent random variables 


Why should log 1/p; have anything to do with the information content? Why 
not some other function of p;? We’ll explore this question in detail shortly, 
but first, notice a nice property of this particular function h(x) = log 1/p(z). 

Imagine learning the value of two independent random variables, x and y. 
The definition of independence is that the probability distribution is separable 
into a product: 


P(x,y) = P(x) P(y). (4.4) 


Intuitively, we might want any measure of the ‘amount of information gained’ 
to have the property of additivity — that is, for independent random variables 
x and y, the information gained when we learn x and y should equal the sum 
of the information gained if x alone were learned and the information gained 
if y alone were learned. 

The Shannon information content of the outcome zg, y is 


1 1 1 1 
h(a, y) = log ———~ = log ————~ = log —— + log —— 4.5 
(a) = oe Saa Para BPa tra CP 
so it does indeed satisfy 
h(x, y) = h(x) + h(y), if z and y are independent. (4.6) 
Exercise 4.2.4 P-86] Show that, if x and y are independent, the entropy of the 
= outcome 2, y satisfies 


H(X,Y)= H(X) + H(Y). (4.7) 
In words, entropy is additive for independent variables. 


We now explore these ideas with some examples; then, in section 4.4 and 
in Chapters 5 and 6, we prove that the Shannon information content and the 
entropy are related to the number of bits needed to describe the outcome of 
an experiment. 


The weighing problem: designing informative experiments 


Have you solved the weighing problem (exercise 4.1, p.66) yet? Are you sure? 
Notice that in three uses of the balance — which reads either ‘left heavier’, 
‘right heavier’, or ‘balanced’ — the number of conceivable outcomes is 3? = 27, 
whereas the number of possible states of the world is 24: the odd ball could 
be any of twelve balls, and it could be heavy or light. So in principle, the 
problem might be solvable in three weighings — but not in two, since 3? < 24. 

If you know how you can determine the odd weight and whether it is 
heavy or light in three weighings, then you may read on. If you haven’t found 
a strategy that always gets there in three weighings, I encourage you to think 
about exercise 4.1 some more. 

Why is your strategy optimal? What is it about your series of weighings 
that allows useful information to be gained as quickly as possible? The answer 
is that at each step of an optimal procedure, the three outcomes (‘left heavier’, 
‘right heavier’, and ‘balance’) are as close as possible to equiprobable. An 
optimal solution is shown in figure 4.2. 

Suboptimal strategies, such as weighing balls 1-6 against 7-12 on the first 
step, do not achieve all outcomes with equal probability: these two sets of balls 
can never balance, so the only possible outcomes are ‘left heavy’ and ‘right 
heavy’. Such a binary outcome rules out only half of the possible hypotheses, 
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Figure 4.2. An optimal solution to the weighing problem. At each step there are two boxes: the left 
box shows which hypotheses are still possible; the right box shows the balls involved in the 
next weighing. The 24 hypotheses are written 1+,...,127, with, e.g., 1* denoting that 
1 is the odd ball and it is heavy. Weighings are written by listing the names of the balls 
on the two pans, separated by a line; for example, in the first weighing, balls 1, 2, 3, and 
4 are put on the left-hand side and 5, 6, 7, and 8 on the right. In each triplet of arrows 
the upper arrow leads to the situation when the left side is heavier, the middle arrow to 
the situation when the right side is heavier, and the lower arrow to the situation when the 
outcome is balanced. The three points labelled x» correspond to impossible outcomes. 


ail| Lax 
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so a strategy that uses such outcomes must sometimes take longer to find the 
right answer. 

The insight that the outcomes should be as near as possible to equiprobable 
makes it easier to search for an optimal strategy. The first weighing must 
divide the 24 possible hypotheses into three groups of eight. Then the second 
weighing must be chosen so that there is a 3:3:2 split of the hypotheses. 

Thus we might conclude: 


the outcome of a random experiment is guaranteed to be most in- 


formative if the probability distribution over outcomes is uniform. 





This conclusion agrees with the property of the entropy that you proved 
when you solved exercise 2.25 (p.37): the entropy of an ensemble X is biggest 
if all the outcomes have equal probability p;=1/|Ax|. 


Guessing games 


In the game of twenty questions, one player thinks of an object, and the 
other player attempts to guess what the object is by asking questions that 
have yes/no answers, for example, ‘is it alive?’, or ‘is it human?’ The aim 
is to identify the object with as few questions as possible. What is the best 
strategy for playing this game? For simplicity, imagine that we are playing 
the rather dull version of twenty questions called ‘sixty-three’. 


Example 4.3. The game ‘sixty-three’. What’s the smallest number of yes/no 
questions needed to identify an integer x between 0 and 63? 


Intuitively, the best questions successively divide the 64 possibilities into equal 
sized sets. Six questions suffice. One reasonable strategy asks the following 
questions: 


is x > 32? 

is z mod 32 > 16? 
is z mod 16 > 8? 
is z mod 8 > 4? 
is xmod4 > 2? 
is rmod2 = 1? 


DY) OT OO Nes 


[The notation «mod 32, pronounced ‘x modulo 32’, denotes the remainder 
when z is divided by 32; for example, 35 mod 32 = 3 and 32 mod 32 = 0.] 

The answers to these questions, if translated from {yes, no} to {1,0}, give 
the binary expansion of x, for example 35 = 100011. o 





What are the Shannon information contents of the outcomes in this ex- 
ample? If we assume that all values of x are equally likely, then the answers 
to the questions are independent and each has Shannon information content 
logə(1/0.5) = 1 bit; the total Shannon information gained is always six bits. 
Furthermore, the number zx that we learn from these questions is a six-bit bi- 
nary number. Our questioning strategy defines a way of encoding the random 
variable x as a binary file. 

So far, the Shannon information content makes sense: it measures the 
length of a binary file that encodes x. However, we have not yet studied 
ensembles where the outcomes have unequal probabilities. Does the Shannon 
information content make sense there too? 


4.1: How to measure the information content of a random variable? 
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A 
B 
Cc 
D 
E 
F 
G 
H 
move # 
question 
outcome 
P ae gee aie 2 
(2) 64 63 33 I7 
h(x) 0.0227 0.0230 0.0443 
Total info. 0.0227 0.0458 1.0 2.0 


The game of submarine: how many bits can one bit convey? 


In the game of battleships, each player hides a fleet of ships in a sea represented 
by a square grid. On each turn, one player attempts to hit the other’s ships by 
firing at one square in the opponent’s sea. The response to a selected square 
such as ‘G3’ is either ‘miss’, ‘hit’, or ‘hit and destroyed’. 

In a boring version of battleships called submarine, each player hides just 
one submarine in one square of an eight-by-eight grid. Figure 4.3 shows a few 
pictures of this game in progress: the circle represents the square that is being 
fired at, and the xs show squares in which the outcome was a miss, x = n; the 
submarine is hit (outcome x = y shown by the symbol s) on the 49th attempt. 

Each shot made by a player defines an ensemble. The two possible out- 
comes are {y,n}, corresponding to a hit and a miss, and their probabili- 
ties depend on the state of the board. At the beginning, P(y) = 1/64 and 
P(n) = 63/64. At the second shot, if the first shot missed, P(y) = 1/63 and 
P(n) = 62/63. At the third shot, if the first two shots missed, P(y) = 1/62 
and P(n) = 61/62. 

The Shannon information gained from an outcome « is h(x) = log(1/P(2)). 
If we are lucky, and hit the submarine on the first shot, then 


h(x) = ha)(y) = logs 64 = 6 bits. (4.8) 


Now, it might seem a little strange that one binary outcome can convey six 
bits. But we have learnt the hiding place, which could have been any of 64 
squares; so we have, by one lucky binary question, indeed learnt six bits. 

What if the first shot misses? The Shannon information that we gain from 
this outcome is 


4 
h(a) = ha) (n) = log = = 0.0227 bits. (4.9) 


Does this make sense? It is not so obvious. Let’s keep going. If our second 
shot also misses, the Shannon information content of the second outcome is 


63 
ho) (a) = logy = = 0.0230 bits. (4.10) 
62 
If we miss thirty-two times (firing at a new square each time), the total Shan- 
non information gained is 


lo Be + lo eS + + lo 38 
82 63 82 62 82 32 
= 0.0227 + 0.0230 + ---+ 0.0430 = 


1.0 bits. (4.11) 





0.0874 








16 
4.0 
6.0 


Figure 4.3. A game of submarine. 
The submarine is hit on the 49th 
attempt. 
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Why this round number? Well, what have we learnt? We now know that the 
submarine is not in any of the 32 squares we fired at; learning that fact is just 
like playing a game of sixty-three (p.70), asking as our first question ‘is x 
one of the thirty-two numbers corresponding to these squares I fired at?’, and 
receiving the answer ‘no’. This answer rules out half of the hypotheses, so it 
gives us one bit. 

After 48 unsuccessful shots, the information gained is 2 bits: the unknown 
location has been narrowed down to one quarter of the original hypothesis 
space. 

What if we hit the submarine on the 49th shot, when there were 16 squares 
left? The Shannon information content of this outcome is 


hqs) (y) = logy 16 = 4.0 bits. (4.12) 


The total Shannon information content of all the outcomes is 





l 64 7 63 l 17 l 16 
o + lo H- + lo + lo 
82 63 82 62 82 16 82 I 
= 0.0227 + 0.0230+---+0.0874+4.0 = 6.0bits. (4.13) 





So once we know where the submarine is, the total Shannon information con- 
tent gained is 6 bits. 

This result holds regardless of when we hit the submarine. If we hit it 
when there are n squares left to choose from — n was 16 in equation (4.13) — 








then the total information gained is: 1 aaail 
64 63 ntl n aaaiu 
logs 63 + logs 62 +- + logo = + logs I 3 a 

63 n+1lon 64 . 3 
logs 63 x 62 Met OK a x I = logs T = 6 bits. (4.14) 129 abati 
What have we learned from the examples so far? I think the submarine 2047 azpan 
example makes quite a convincing case for the claim that the Shannon infor- 2048 aztdn 


mation content is a sensible measure of information content. And the game of 
sixty-three shows that the Shannon information content can be intimately 
connected to the size of a file that encodes the outcomes of a random experi- : 

ment, thus suggesting a possible connection to data compression. 16384 odrer 


In case yow’re not convinced, let’s look at one more example. 


The Wenglish language 32737 aia 


Wenglish is a language similar to English. Wenglish sentences consist of words ; 
drawn at random from the Wenglish dictionary, which contains 215 = 32,768 32768 zxast 
words, all of length 5 characters. Each word in the Wenglish dictionary was TE 
constructed at random by picking five letters from the probability distribution Figure 4.4. The Wenglish 
over a...z depicted in figure 2.1. dictionary. 

Some entries from the dictionary are shown in alphabetical order in fig- 
ure 4.4. Notice that the number of words in the dictionary (32,768) is 
much smaller than the total number of possible words of length 5 letters, 
26° ~ 12,000,000. 

Because the probability of the letter z is about 1/1000, only 32 of the 
words in the dictionary begin with the letter z. In contrast, the probability 
of the letter a is about 0.0625, and 2048 of the words begin with the letter a. 
Of those 2048 words, two start az, and 128 start aa. 

Let’s imagine that we are reading a Wenglish document, and let’s discuss 
the Shannon information content of the characters as we acquire them. If we 
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are given the text one word at a time, the Shannon information content of 
each five-character word is log 32,768 = 15 bits, since Wenglish uses all its 
words with equal probability. The average information content per character 
is therefore 3 bits. 

Now let’s look at the information content if we read the document one 
character at a time. If, say, the first letter of a word is a, the Shannon 
information content is log 1/0.0625 ~ 4 bits. If the first letter is z, the Shannon 
information content is log 1/0.001 œ 10 bits. The information content is thus 
highly variable at the first character. The total information content of the 5 
characters in a word, however, is exactly 15 bits; so the letters that follow an 
initial z have lower average information content per character than the letters 
that follow an initial a. A rare initial letter such as z indeed conveys more 
information about what the word is than a common initial letter. 

Similarly, in English, if rare characters occur at the start of the word (e.g. 
xyl...), then often we can identify the whole word immediately; whereas 
words that start with common characters (e.g. pro...) require more charac- 
ters before we can identify them. 


> 4.2 Data compression 


The preceding examples justify the idea that the Shannon information content 
of an outcome is a natural measure of its information content. Improbable out- 
comes do convey more information than probable outcomes. We now discuss 
the information content of a source by considering how many bits are needed 
to describe the outcome of an experiment. 

If we can show that we can compress data from a particular source into 
a file of L bits per source symbol and recover the data reliably, then we will 
say that the average information content of that source is at most L bits per 
symbol. 


Example: compression of teat files 


A file is composed of a sequence of bytes. A byte is composed of 8 bits and Here we use the word ‘bit’ with its 
can have a decimal value between 0 and 255. A typical text file is composed meaning, ‘a symbol with two 
of the ASCII character set (decimal values 0 to 127). This character set uses values’, not to be confused with 
See the unit of information content. 
only seven of the eight bits in a byte. 
> Exercise 4.4.14» P-86] By how much could the size of a file be reduced given 
that it is an ASCII file? How would you achieve this reduction? 


Intuitively, it seems reasonable to assert that an ASCII file contains 7/8 as 
much information as an arbitrary file of the same size, since we already know 
one out of every eight bits before we even look at the file. This is a simple ex- 
ample of redundancy. Most sources of data have further redundancy: English 
text files use the ASCII characters with non-equal frequency; certain pairs of 
letters are more probable than others; and entire words can be predicted given 
the context and a semantic understanding of the text. 


Some simple data compression methods that define measures of informa- 
tion content 


One way of measuring the information content of a random variable is simply 
to count the number of possible outcomes, |Ax|. (The number of elements in 
a set A is denoted by |A|.) If we gave a binary name to each outcome, the 
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length of each name would be log, |Ax| bits, if |Ax| happened to be a power 
of 2. We thus make the following definition. 


The raw bit content of X is 


H(X) = log; Axl: (4.15) 


Ho(X) is a lower bound for the number of binary questions that are always 
guaranteed to identify an outcome from the ensemble X. It is an additive 
quantity: the raw bit content of an ordered pair x, y, having |Ax||Ay| possible 
outcomes, satisfies 


Ho(X,Y) = Ho(X) + Ho(Y). (4.16) 


This measure of information content does not include any probabilistic 
element, and the encoding rule it corresponds to does not ‘compress’ the source 
data, it simply maps each outcome to a constant-length binary string. 


“Exercise 4.5.17 P86] Could there be a compressor that maps an outcome x to 
= a binary code c(x), and a decompressor that maps c back to x, such 
that every possible outcome is compressed into a binary code of length 

shorter than Ho(X) bits? 


Even though a simple counting argument shows that it is impossible to make 
a reversible compression program that reduces the size of all files, ama- 
teur compression enthusiasts frequently announce that they have invented 
a program that can do this — indeed that they can further compress com- 
pressed files by putting them through their compressor several times. Stranger 
yet, patents have been granted to these modern-day alchemists. See the 
comp. compression frequently asked questions for further reading. 1 

There are only two ways in which a ‘compressor’ can actually compress 


files: 


1. A lossy compressor compresses some files, but maps some files to the 
same encoding. We’ll assume that the user requires perfect recovery of 
the source file, so the occurrence of one of these confusable files leads 
to a failure (though in applications such as image compression, lossy 
compression is viewed as satisfactory). We’ll denote by 6 the probability 
that the source string is one of the confusable files, so a lossy compressor 
has a probability ô of failure. If ô can be made very small then a lossy 
compressor may be practically useful. 


2. A lossless compressor maps all files to different encodings; if it shortens 
some files, it necessarily makes others longer. We try to design the 
compressor so that the probability that a file is lengthened is very small, 
and the probability that it is shortened is large. 


In this chapter we discuss a simple lossy compressor. In subsequent chapters 
we discuss lossless compression methods. 


> 4.3 Information content defined in terms of lossy compression 


Whichever type of compressor we construct, we need somehow to take into 
account the probabilities of the different outcomes. Imagine comparing the 
information contents of two text files — one in which all 128 ASCII characters 


‘http: //sunsite.org.uk/public/usenet/news-faqs/comp.compression/ 
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are used with equal probability, and one in which the characters are used with 
their frequencies in English text. Can we define a measure of information 
content that distinguishes between these two files? Intuitively, the latter file 
contains less information per character because it is more predictable. 

One simple way to use our knowledge that some symbols have a smaller 
probability is to imagine recoding the observations into a smaller alphabet 
— thus losing the ability to encode some of the more improbable symbols — 
and then measuring the raw bit content of the new alphabet. For example, 
we might take a risk when compressing English text, guessing that the most 
infrequent characters won’t occur, and make a reduced ASCII code that omits 
the characters { !, ©, #,%, 7, *,~,<,>,/,\,-, 0,3, 61, | }, thereby reducing 
the size of the alphabet by seventeen. The larger the risk we are willing to 
take, the smaller our final alphabet becomes. a aM 

We introduce a parameter 6 that describes the risk we are taking when 6=0 6=1/16 
using this compression method: 6 is the probability that there will be no 2 Epes “Ean RA 


x C(x) z C(x) 
name for an outcome zg. 
a 000 a 00 
Example 4.6. Let b 001 b oi 
Ax={a, b, C, d, e, f; 8, h }, 4.17 
Ce ee a (4.17) c 010 c 10 
an x 4° 4) 4) 16? 64? 64? 64? G4 J d O11 d 11 
The raw bit content of this ensemble is 3 bits, corresponding to 8 binary e 100 é = 
names. But notice that P(x € {a,b,c,d}) = 15/16. So if we are willing f£ 101 E = 
to run a risk of 6 = 1/16 of not having a name for x, then we can get 
i ; g 110 St S 
by with four names — half as many names as are needed if every x € Ax h 1411 h S 


has a name. 


Table 4.5 shows binary names that could be given to the different out- Table 4.5. Binary names for the 
comes in the cases 6 = 0 and 6 = 1/16. When 6 = 0 we need 3 bits to outcomes, for two failure 
encode the outcome; when 6 = 1/16 we need only 2 bits. probabilities 6. 


Let us now formalize this idea. To make a compression strategy with risk 
ô, we make the smallest possible subset Ss such that the probability that x is 
not in Ss is less than or equal to ô, i.e., P(x ¢ Ss) < 6. For each value of ô 
we can then define a new measure of information content — the log of the size 
of this smallest subset Ss. [In ensembles in which several elements have the 
same probability, there may be several smallest subsets that contain different 
elements, but all that matters is their sizes (which are equal), so we will not 
dwell on this ambiguity. ] 


The smallest 5-sufficient subset 55 is the smallest subset of Ax satisfying 
P(x € S) >1-6. (4.18) 


The subset Ss can be constructed by ranking the elements of Ax in order of 
decreasing probability and adding successive elements starting from the most 
probable elements until the total probability is > (1— ô). 

We can make a data compression code by assigning a binary name to each 
element of the smallest sufficient subset. This compression scheme motivates 
the following measure of information content: 


The essential bit content of X is: 
Hs(X) = logs |S. (4.19) 


Note that Ho(X) is the special case of H(X) with ô = 0 (if P(x) > 0 for all 
x € Ax). [Caution: do not confuse Ho(X) and H5(X) with the function H2(p) 
displayed in figure 4.1.] 

Figure 4.6 shows H(X) for the ensemble of example 4.6 as a function of 
ô. 
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Figure 4.6. (a) The outcomes of X 
(from example 4.6 (p.75)), ranked 
by their probability. (b) The 
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Extended ensembles 


Is this compression method any more useful if we compress blocks of symbols 
from a source? 

We now turn to examples where the outcome x = (%1,%,...,@N) is a 
string of N independent identically distributed random variables from a single 
ensemble X. We will denote by X% the ensemble (X 1, X2,..., Xy). Remem- 
ber that entropy is additive for independent variables (exercise 4.2 (p.68)), so 
H(X) = NH(X). 


Example 4.7. Consider a string of N flips of a bent coin, x = (£1, £2,..., £N), 
where £n € {0,1}, with probabilities pp =0.9, pı =0.1. The most prob- 
able strings x are those with most Os. If r(x) is the number of 1s in x 


then 

Pasa, ae (4.20) 
To evaluate H3(X) we must find the smallest sufficient subset Ss. This 
subset will contain all x with r(x) = 0,1,2,..., up to some rmax() — 1, 


and some of the x with r(x) = rmax(6). Figures 4.7 and 4.8 show graphs 
of H(X”) against 6 for the cases N = 4 and N = 10. The steps are the 
values of ô at which |.S5| changes by 1, and the cusps where the slope of 
the staircase changes are the points where rmax changes by 1. 


Exercise 4.8.1% P-86] What are the mathematical shapes of the curves between 
the cusps? 


For the examples shown in figures 4.6-4.8, H(X N ) depends strongly on 
the value of 6, so it might not seem a fundamental or useful definition of 
information content. But we will consider what happens as N, the number 
of independent variables in X^, increases. We will find the remarkable result 
that Hs(X%) becomes almost independent of ô — and for all 6 it is very close 
to NH(X), where H(X) is the entropy of one of the random variables. 

Figure 4.9 illustrates this asymptotic tendency for the binary ensemble of 
example 4.7. As N increases, +H (X^) becomes an increasingly flat function, 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


4.3: Information content defined in terms of lossy compression 77 


log, P(x) 
14 —12 —10 —8 _6 _4 2 0 Figure 4.7. (a) The sixteen 
outcomes of the ensemble X4 with 
pı = 0.1, ranked by probability. 
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except for tails close to 6 = 0 and 1. As long as we are allowed a tiny 
probability of error 6, compression down to NH bits is possible. Even if we 


are allowed a large probability of error, we still can compress only down to 


NH bits. This is the source coding theorem. 


Theorem 4.1 Shannon's source coding theorem. Let X be an ensemble with 
entropy H(X) = H bits. Given e > 0 and0 < ô <1, there exists a positive 


integer No such that for N > No, 
a (X%)-—H] < 
— — €. 
N ô 


> 4.4 Typicality 


(4.21) 


Why does increasing N help? Let’s examine long strings from X^. Table 4.10 
shows fifteen samples from X^ for N = 100 and pı = 0.1. The probability 


of a string x that contains r 1s and N—r Os is 
P(x) = pil — pi)". 


The number of strings that contain r 1s is 


ore 


So the number of 1s, r, has a binomial distribution: 


Py = (7 JA -p0 


(4.22) 


(4.23) 


(4.24) 


These functions are shown in figure 4.11. The mean of r is Npı, and its 


standard deviation is ,/Npi(1 — pı) (p.1). If N is 100 then 


r~ Npi +y Np (1 -— p) > 103. 








(4.25) 
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Figure 4.10. The top 15 strings 
are samples from X10, where 

pı = 0.1 and po = 0.9. The 
bottom two are the most and 
least probable strings in this 
ensemble. The final column shows 
the log-probabilities of the 
random strings, which may be 
compared with the entropy 
H(X1°) = 46.9 bits. 
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Figure 4.11. Anatomy of the typical set T. For pı = 0.1 and N = 100 and N = 1000, these graphs 
show n(r), the number of strings containing r 1s; the probability P(x) of a single string 
that contains r 1s; the same probability on a log scale; and the total probability n(r) P(x) of 
all strings that contain r 1s. The number r is on the horizontal axis. The plot of log, P(x) 
also shows by a dotted line the mean value of logy P(x) = —N Ho(p1), which equals —46.9 
when N = 100 and —469 when N = 1000. The typical set includes only the strings that 
have log, P(x) close to this value. The range marked T shows the set Tyg (as defined in 
section 4.4) for N = 100 and 6 = 0.29 (left) and N = 1000, 8 = 0.09 (right). 
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If N = 1000 then 
r~ 100+ 10. (4.26) 


Notice that as N gets bigger, the probability distribution of r becomes more 
concentrated, in the sense that while the range of possible values of r grows 
as N, the standard deviation of r grows only as VN. That r is most likely to 
fall in a small range of values implies that the outcome x is also most likely to 
fall in a corresponding small subset of outcomes that we will call the typical 
set. 





Definition of the typical set 


Let us define typicality for an arbitrary ensemble X with alphabet Ax. Our 
definition of a typical string will involve the string’s probability. A long string 
of N symbols will usually contain about p;N occurrences of the first symbol, 
p2N occurrences of the second, etc. Hence the probability of this string is 
roughly 


P(x)typ = P(#1)P(#2)P(a3)... P(ay) ~ pP pP)... pP (4.27) 


so that the information content of a typical string is 


1 1 
logo car ps pi logy a NH. (4.28) 
So the random variable logs!/P(x), which is the information content of x, is 
very likely to be close in value to NH. We build our definition of typicality 
on this observation. 

We define the typical elements of AXN to be those elements that have prob- 
ability close to 2-N#, (Note that the typical set, unlike the smallest sufficient 
subset, does not include the most probable elements of AN, but we will show 
that these most probable elements contribute negligible probability.) 

We introduce a parameter ( that defines how close the probability has to 
be to 2-N# for an element to be ‘typical’. We call the set of typical elements 
the typical set, Tyg: 





1 1 
Tye = {x AÑ : oe P n <8}. (4.29) 


We will show that whatever value of 3 we choose, the typical set contains 
almost all the probability as N increases. 

This important result is sometimes called the ‘asymptotic equipartition’ 
principle. 


‘Asymptotic equipartition’ principle. For an ensemble of N independent 
identically distributed (i.i.d.) random variables XY = (X1, X2,..., XN), 
with N sufficiently large, the outcome x = (%1,2%2,...,2N) is almost 
certain to belong to a subset of AN having only 2V"() members, each 
having probability ‘close to’ 2-N#™), 


Notice that if H(X) < Ho(X) then 2N?@) is a tiny fraction of the number 
of possible outcomes AX] = |Ax|¥ = 20N), 


The term equipartition is chosen to describe the idea that the members of 
the typical set have roughly equal probability. [This should not be taken too 
literally, hence my use of quotes around ‘asymptotic equipartition’; see page 
83.] 

A second meaning for equipartition, in thermal physics, is the idea that each 
degree of freedom of a classical system has equal average energy, $kT. This 
second meaning is not intended here. 
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logs P(x) 


—~NH(X) 





so me wom mmm THU 


1111111111110. ..11111110111 


0000100000010. . .00001000010 


0100000001000. . .00010000000 


0001000000000. . .00000000000 
0000000000000. . . 00000000000 


The ‘asymptotic equipartition’ principle is equivalent to: 


Shannon’s source coding theorem (verbal statement). N ii.d. ran- 
dom variables each with entropy H(X) can be compressed into more 
than NH(X) bits with negligible risk of information loss, as N — co; 
conversely if they are compressed into fewer than NH(X) bits it is vir- 
tually certain that information will be lost. 


These two theorems are equivalent because we can define a compression algo- 


rithm that gives a distinct name of length N H(X) bits to each x in the typical 
set. 


> 4.5 Proofs 


This section may be skipped if found tough going. 


The law of large numbers 
Our proof of the source coding theorem uses the law of large numbers. 


Mean and variance of a real random variable are E[u] = u = >>, P(u)u 
and var(u) = o? = E[(u— %)?] = >, P(u)(u — a)”. 


Technical note: strictly I am assuming here that u is a function u(x) 
of a sample x from a finite discrete ensemble X. Then the summations 
X P(u)f (u) should be written >, P(x) f(u(x)). This means that P(u) 
is a finite sum of delta functions. This restriction guarantees that the 
mean and variance of u do exist, which is not necessarily the case for 
general P(w). 


Chebyshev’s inequality 1. Let t be a non-negative real random variable, 
and let a be a positive real number. Then 


P(t>a) < (4.30) 


Qh 


Proof: P(t > a) = J >a P(t). We multiply each term by t/a > 1 and 
obtain: P(t > a) < S>,.,, P(t)t/a. We add the (non-negative) missing 
terms and obtain: P(t > a) < >, P(t)t/a = t/a. Oo 





Figure 4.12. Schematic diagram 
showing all strings in the ensemble 
XN ranked by their probability, 
and the typical set Tyg. 
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Chebyshev’s inequality 2. Let x be a random variable, and let a be a 
positive real number. Then 





P((e—2)? >a) < o8/a. (4.31) 
Proof: Take t = (x — 7)? and apply the previous proposition. o 
Weak law of large numbers. Take z to be the average of N independent 
random variables hy,...,h), having common mean h and common vari- 

ance oĉ: x = + Soy hn. Then 
P((2 — h? > a) < of /aN. (4.32) 
Proof: obtained by showing that z = h and that o2 = o7/N. o 





We are interested in x being very close to the mean (a very small). No matter 
how large o? is, and no matter how small the required a is, and no matter 


how small the desired probability that (x — h)? > a, we can always achieve it 
by taking N large enough. 


Proof of theorem 4.1 (p.78) 


We apply the law of large numbers to the random variable x logs T defined 


for x drawn from the ensemble X^. This random variable can be written as 
the average of N information contents hn = logs(1/P(a,)), each of which is a 
random variable with mean H = H(X) and variance o? = var[logy(1/P(xn))]. 
(Each term hn is the Shannon information content of the nth outcome.) 
We again define the typical set with parameters N and ( thus: 
v.|i 1 See 
Tug = xe Ay: aloe Bay H] <B>. (4.33) 


For all x € Tyg, the probability of x satisfies 
27NH+6) < P(x) < 27NH-2), (4.34) 
And by the law of large numbers, 
z2 
BPN 
We have thus proved the ‘asymptotic equipartition’ principle. As N increases, 
the probability that x falls in Tyg approaches 1, for any 3. How does this 


result relate to source coding? 
We must relate Tyg to H(X N), We will show that for any given ô there 


P(x E Tyg) >1- (4.35) 




















is a sufficiently big N such that H(X”) ~ NH. SHX”) 
A 

Part 1: H(X) < H +e. Ho(X) 
The set Tyg is not the best subset for compression. So the size of Tyg gives -H+e 
an upper bound on Hs. We show how small H5(X%) must be by calculating ment 
how big Tyg could possibly be. We are free to set @ to any convenient value. He 
The smallest possible probability that a member of Tyg can have is 27N (H6), 
and the total probability contained by Tyg can’t be any bigger than 1. So 0 1 E 

[Tale NE+A <1, (4.36) 


Figure 4.13. Schematic illustration 
of the two parts of the theorem. 
Tvl < 9N(H+6), (4.37) Given any 6 and €, we show that 
5 for large enough N, H(X”) 
If we set 8 = € and No such that ZN; <6, then P(Tyg) > 1— ô, and the set lies (1) below the line H + € and 


Tyg becomes a witness to the fact that H(X”) < logs |Twg| < N(H + 6). (2) above the line H — e. 


that is, the size of the typical set is bounded by 
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Part 2: H(X") > H — e. 


Imagine that someone claims this second part is not so — that, for any N, 
the smallest ô-sufficient subset S5 is smaller than the above inequality would 
allow. We can make use of our typical set to show that they must be mistaken. 
Remember that we are free to set 8 to any value we choose. We will set 
B = €/2, so that our task is to prove that a subset 9” having |S’| < 2N(4#-?9) 
and achieving P(x € S’) > 1 — ô cannot exist (for N greater than an No that 
we will specify). 
So, let us consider the probability of falling in this rival smaller subset S”. Typ s! 
The probability of the subset S’ is 
S' Tye 


P(x € S') = P(x € S’NTyg) + P(x € S’'NTya), (4.38) S'A Tg 


where Tne denotes the complement {x ¢ Tyg}. The maximum value of 
the first term is found if 9 N Tyg contains 2N(H—-28) outcomes all with the 
maximum probability, 2-N(#-%), The maximum value the second term can 
have is P(x ¢ Tyg). So: 


/ N(H—28) 5—N(H-8 o? _ -NB o? 

P(x € 9") < 20-28) oo + aay =? + BEN (4.39) 
We can now set 3 = €/2 and No such that P(x € S’) < 1— ô, which shows 
that S’ cannot satisfy the definition of a sufficient subset Ss. Thus any subset 
S! with size |S’| < 2(#—-9 has probability less than 1— ô, so by the definition 
of Hs, H(X) > N(H — 6). 

Thus for large enough N, the function qHs(X X ) is essentially a constant 
function of 6, for 0 < ô < 1, as illustrated in figures 4.9 and 4.13. o 





> 4.6 Comments 


The source coding theorem (p.78) has two parts, qHs(X%) < H+ e, and 
H(X") > H — e. Both results are interesting. 

The first part tells us that even if the probability of error 6 is extremely 
small, the number of bits per symbol qe H(X” ) needed to specify a long 
N-symbol string x with vanishingly small error probability does not have to 
exceed H + e bits. We need to have only a tiny tolerance for error, and the 
number of bits required drops significantly from Ho(X) to (H + ©). 

What happens if we are yet more tolerant to compression errors? Part 2 
tells us that even if 6 is very close to 1, so that errors are made most of the 
time, the average number of bits per symbol needed to specify x must still be 
at least H — e bits. These two extremes tell us that regardless of our specific 
allowance for error, the number of bits per symbol needed to specify x is H 
bits; no more and no less. 


Caveat regarding ‘asymptotic equipartition’ 


I put the words ‘asymptotic equipartition’ in quotes because it is important 
not to think that the elements of the typical set Tyg really do have roughly 
the same probability as each other. They are similar in probability only in 
the sense that their values of logs Pa are within 2NG of each other. Now, as 
b is decreased, how does N have to increase, if we are to keep our bound on 


the mass of the typical set, P(x € Tyg) > 1 — = constant? N must grow 


as 1/82, so, if we write 8 in terms of N as a/v N, for some constant a, then 
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the most probable string in the typical set will be of order 2°VN times greater 
than the least probable string in the typical set. As 8 decreases, N increases, 
and this ratio 2°V grows exponentially. Thus we have ‘equipartition’ only in 
a weak sense! 


Why did we introduce the typical set? 


The best choice of subset for block compression is (by definition) Ss, not a 
typical set. So why did we bother introducing the typical set? The answer is, 
we can count the typical set. We know that all its elements have ‘almost iden- 
tical’ probability (2-8 H), and we know the whole set has probability almost 
1, so the typical set must have roughly 2“ elements. Without the help of 
the typical set (which is very similar to Ss) it would have been hard to count 
how many elements there are in S5. 


> 4.7 Exercises 


Weighing problems 


> Exercise 4.9.4] While some people, when they first encounter the weighing 
problem with 12 balls and the three-outcome balance (exercise 4.1 
(p.66)), think that weighing six balls against six balls is a good first 
weighing, others say ‘no, weighing six against six conveys no informa- 
tion at all’. Explain to the second group why they are both right and 
wrong. Compute the information gained about which is the odd ball, 
and the information gained about which is the odd ball and whether it is 
heavy or light. 

> Exercise 4.10.!] Solve the weighing problem for the case where there are 39 
balls of which one is known to be odd. 


> Exercise 4.11.!?] You are given 16 balls, all of which are equal in weight except 
for one that is either heavier or lighter. You are also given a bizarre two- 
pan balance that can report only two outcomes: ‘the two sides balance’ 
or ‘the two sides do not balance’. Design a strategy to determine which 
is the odd ball in as few uses of the balance as possible. 


> Exercise 4.12.!7] You have a two-pan balance; your job is to weigh out bags of 
flour with integer weights 1 to 40 pounds inclusive. How many weights 
do you need? [You are allowed to put weights on either pan. You’re only 
allowed to put one flour bag on the balance at a time.] 


Exercise 4.13.14 P-86] (a) Is it possible to solve exercise 4.1 (p.66) (the weigh- 
ing problem with 12 balls and the three-outcome balance) using a 
sequence of three fixed weighings, such that the balls chosen for the 
second weighing do not depend on the outcome of the first, and the 
third weighing does not depend on the first or second? 


(b) Find a solution to the general N-ball weighing problem in which 
exactly one of N balls is odd. Show that in W weighings, an odd 
ball can be identified from among N = (3 — 3)/2 balls. 


Exercise 4.14.1] You are given 12 balls and the three-outcome balance of exer- 
cise 4.1; this time, two of the balls are odd; each odd ball may be heavy 
or light, and we don’t know which. We want to identify the odd balls 
and in which direction they are odd. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


4.7: Exercises 


(a) Estimate how many weighings are required by the optimal strategy. 
And what if there are three odd balls? 


(b) How do your answers change if it is known that all the regular balls 
weigh 100 g, that light balls weigh 99 g, and heavy ones weigh 110 g? 
Source coding with a lossy compressor, with loss 6 


Exercise 4.15.1% P87] Let Px = {0.2, 0.8}. Sketch +H5(X) as a function of 
6 for N = 1,2 and 1000. 


Exercise 4.16.'7] Let Py = {0.5,0.5}. Sketch 7H5(Y%) as a function of 6 for 
N =1,2,3 and 100. 








Exercise 4.17.1 P-87] (For physics students.) Discuss the relationship between 
the proof of the ‘asymptotic equipartition’ principle and the equivalence 
(for large systems) of the Boltzmann entropy and the Gibbs entropy. 


Distributions that don’t obey the law of large numbers 


The law of large numbers, which we used in this chapter, shows that the mean 
of a set of N i.i.d. random variables has a probability distribution that becomes 
narrower, with width « 1/ VN, as N increases. However, we have proved 
this property only for discrete random variables, that is, for real numbers 
taking on a finite set of possible values. While many random variables with 
continuous probability distributions also satisfy the law of large numbers, there 
are important distributions that do not. Some continuous distributions do not 
have a mean or variance. 


> Exercise 4.18.1% P-88] Sketch the Cauchy distribution 


1 1 


POR ay 


x E€ (—00, oo). (4.40) 
What is its normalizing constant Z? Can you evaluate its mean or 
variance? 


Consider the sum z = 41 +22, where x; and x2 are independent random 
variables from a Cauchy distribution. What is P(z)? What is the prob- 
ability distribution of the mean of x; and x2, = (41 + 2)/2? What is 
the probability distribution of the mean of N samples from this Cauchy 
distribution? 


Other asymptotic properties 


Exercise 4.19.1] Chernoff bound. We derived the weak law of large numbers 
from Chebyshev’s inequality (4.30) by letting the random variable t in 
the inequality P(t > a) < t/a be a function, t = (x—Z)?, of the random 
variable x we were interested in. 

Other useful inequalities can be obtained by using other functions. The 
Chernoff bound, which is useful for bounding the tails of a distribution, 
is obtained by letting t = exp(sz). 
Show that 

P(a >a) <e “g(s), for any s >0 (4.41) 


and 
P(x <a) <e *"g(s), for any s <0 (4.42) 
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where g(s) is the moment-generating function of z, 
= Plaje™. (4.43) 


Curious functions related to plog 1/p 


Exercise 4.20.4 P-89] This exercise has no purpose at all; it’s included for the 
enjoyment of those who like mathematical curiosities. 


Sketch the function 
f(z) = 2%" (4.44) 


for x > 0. Hint: Work out the inverse function to f — that is, the function 
g(y) such that if x = g(y) then y = f(x) — it’s closely related to plog 1/p. 


> 4.8 Solutions 


Solution to exercise 4.2 (p.68). Let P(x,y) = P(x)P(y). Then 


H(X,Y) = 2 Pa) ae (4.45) 
= Sro ia o ee y) log ——~ 7G 5 (4.46) 
= ne log —~ aa phew ) log —— Fa (4.47) 
= H(X)+H(Y). (4.48) 


Solution to exercise 4.4 (p.73). An ASCII file can be reduced in size by a 
factor of 7/8. This reduction could be achieved by a block code that maps 
8-byte blocks into 7-byte blocks by copying the 56 information-carrying bits 
into 7 bytes, and ignoring the last bit of every character. 


Solution to exercise 4.5 (p.74). The pigeon-hole principle states: you can’t 
put 16 pigeons into 15 holes without using one of the holes twice. 

Similarly, you can’t give Ax outcomes unique binary names of some length 
l shorter than log, |Ax| bits, because there are only 2! such binary names, 
and | < logs |Ax| implies 2! < |Ax|, so at least two different inputs to the 
compressor would compress to the same output file. 


Solution to exercise 4.8 (p.76). Between the cusps, all the changes in proba- 
bility are equal, and the number of elements in T changes by one at each step. 
So Hs varies logarithmically with (—ô). 


Solution to exercise 4.13 (p.84). This solution was found by Dyson and Lyness 
in 1946 and presented in the following elegant form by John Conway in 1999. 
Be warned: the symbols A, B, and C are used to name the balls, to name the 
pans of the balance, to name the outcomes, and to name the possible states 
of the odd ball! 


(a) Label the 12 balls by the sequences 
AAB ABA ABB ABC BBC BCA BCB BCC CAA CAB CAC CCA 


and in the 
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4.8: Solutions 


1st AAB ABA ABB ABC BBC BCA BCB BCC 


2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B. 


3rd ABA BCA CAA CCA AAB ABB BCB CAB 


Now in a given weighing, a pan will either end up in the 


e Canonical position (C) that it assumes when the pans are balanced, 
or 


e Above that position (A), or 
e Below it (B), 


so the three weighings determine for each pan a sequence of three of 
these letters. 


If both sequences are CCC, then there’s no odd ball. Otherwise, for just 
one of the two pans, the sequence is among the 12 above, and names 
the odd ball, whose weight is Above or Below the proper one according 
as the pan is A or B. 


(b) In W weighings the odd ball can be identified from among 
N = (3 — 3)/2 (4.49) 


balls in the same way, by labelling them with all the non-constant se- 
quences of W letters from A, B, C whose first change is A-to-B or B-to-C 
or C-to-A, and at the wth weighing putting those whose wth letter is A 
in pan A and those whose wth letter is B in pan B. 


Solution to exercise 4.15 (p.85). The curves 7H5(X) as a function of ô for 
N =1,2 and 1000 are shown in figure 4.14. Note that H2(0.2) = 0.72 bits. 





i Net — N=1 
Ne1000 
es ares ô H(X) QHs(X) 
0.6 | ieee 0-0.2 1 2 
| pons l 0.2-1 0 1 0 
0.4 l | i CC rr 
| 0 
0.2 f 
0 PEA Ld i f i 
0 0.2 0.4 0.6 0.8 1 








Solution to exercise 4.17 (p.85). The Gibbs entropy is kg >>, piln a where i 
runs over all states of the system. This entropy is equivalent (apart from the 
factor of kp) to the Shannon entropy of the ensemble. 

Whereas the Gibbs entropy can be defined for any ensemble, the Boltz- 
mann entropy is only defined for microcanonical ensembles, which have a 
probability distribution that is uniform over a set of accessible states. The 
Boltzmann entropy is defined to be Sg = kg ln Q where Q is the number of ac- 
cessible states of the microcanonical ensemble. This is equivalent (apart from 
the factor of kg) to the perfect information content Ho of that constrained 
ensemble. The Gibbs entropy of a microcanonical ensemble is trivially equal 
to the Boltzmann entropy. 


N=2 

ô H(X) 25% 
0-0.04 1 4 
.04-0.2 0.79 3 
0.2-0.36 0.5 2 
36-1 0 1 


Figure 4.14. + H5(X) (vertical 
axis) against 6 (horizontal), for 
N =1,2,100 binary variables 
with pı = 0.4. 
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We now consider a thermal distribution (the canonical ensemble), where 
the probability of a state x is 


Po) = Zen (- 2w) (4.50) 





With this canonical ensemble we can associate a corresponding microcanonical 
ensemble, an ensemble with total energy fixed to the mean energy of the 
canonical ensemble (fixed to within some precision €). Now, fixing the total 
energy to a precision € is equivalent to fixing the value of ln !/P(x) to within 
ekgT. Our definition of the typical set Tyg was precisely that it consisted 
of all elements that have a value of log P(x) very close to the mean value of 
log P(x) under the canonical ensemble, —NH(X). Thus the microcanonical 
ensemble is equivalent to a uniform distribution over the typical set of the 
canonical ensemble. 

Our proof of the ‘asymptotic equipartition’ principle thus proves — for the 
case of a system whose energy is separable into a sum of independent terms 
— that the Boltzmann entropy of the microcanonical ensemble is very close 
(for large N) to the Gibbs entropy of the canonical ensemble, if the energy of 
the microcanonical ensemble is constrained to equal the mean energy of the 
canonical ensemble. 


Solution to exercise 4.18 (p.85). The normalizing constant of the Cauchy dis- 





tribution io 
P(x) = ss 
(2) Zz2?+1 
is 36 i 
= = a1. ana 
Z= fare [tan eo ee R (4.51) 


The mean and variance of this distribution are both undefined. (The distribu- 
tion is symmetrical about zero, but this does not imply that its mean is zero. 
The mean is the value of a divergent integral.) The sum z = x1 + £2, where 
xı and x2 both have Cauchy distributions, has probability density given by 
the convolution 


1 °° 1 1 
Pizj= Za dz, = s 4.52 
(z) >J. AFGE mtl ( ) 
which after a considerable labour using standard methods gives 
1 2 1 
Pas (4.53) 


T2 244 n+ 


which we recognize as a Cauchy distribution with width parameter 2 (where 
the original distribution has width parameter 1). This implies that the mean 
of the two points, Z = (xı + x2)/2 = z/2, has a Cauchy distribution with 
width parameter 1. Generalizing, the mean of N samples from a Cauchy 
distribution is Cauchy-distributed with the same parameters as the individual 
samples. The probability distribution of the mean does not become narrower 
as 1/VN. 

The central-limit theorem does not apply to the Cauchy distribution, be- 
cause it does not have a finite variance. 

An alternative neat method for getting to equation (4.53) makes use of the 
Fourier transform of the Cauchy distribution, which is a biexponential e7“, 
Convolution in real space corresponds to multiplication in Fourier space, so 
the Fourier transform of z is simply e~!#“!. Reversing the transform, we obtain 
equation (4.53). 
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4.8: Solutions 89 


Solution to exercise 4.20 (p.86). The function f(x) has inverse function 





gly) =y”. (4.54) “- | 
Note 20 = \ 
log g(y) = 1/y log y. (4.55) w w 
I obtained a tentative graph of f(x) by plotting g(y) with y along the vertical Ny Eerie ae ee ae 


axis and g(y) along the horizontal axis. The resulting graph suggests that 
f(x) is single valued for x € (0,1), and looks surprisingly well-behaved and 
ordinary; for x € (1,e!/°), f(x) is two-valued. f(/2) is equal both to 2 and | 
4. For a > et/° (which is about 1.44), f(x) is infinite. However, it might be 


be 
argued that this approach to sketching f(x) is only partly valid, if we define f ne M 
as the limit of the sequence of functions x, x”, 2’ ,...; this sequence does not f aa Fo ade Bah 
have a limit for 0 < x < (1/e)® ~ 0.07 on account of a pitchfork bifurcation Bea pede teeth ete 
at x = (1/e)®; and for x € (1,e!/°), the sequence’s limit is single-valued — the oF pea 
lower of the two values sketched in the figure. Da ae 


a 
0 0.2 


Figure 4.15. f(x) = a shown 
at three different scales. 
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About Chapter 5 


In the last chapter, we saw a proof of the fundamental status of the entropy 
as a measure of average information content. We defined a data compression 
scheme using fixed length block codes, and proved that as N increases, it is 
possible to encode N i.i.d. variables x = (x1,...,2y) into a block of N(H(X)+ 
€) bits with vanishing probability of error, whereas if we attempt to encode 
XN into N(H(X) — €) bits, the probability of error is virtually 1. 

We thus verified the possibility of data compression, but the block coding 
defined in the proof did not give a practical algorithm. In this chapter and 
the next, we study practical data compression algorithms. Whereas the last 
chapter’s compression scheme used large blocks of fixed size and was lossy, 
in the next chapter we discuss variable-length compression schemes that are 
practical for small block sizes and that are not lossy. 

Imagine a rubber glove filled with water. If we compress two fingers of the 
glove, some other part of the glove has to expand, because the total volume 
of water is constant. (Water is essentially incompressible.) Similarly, when 
we shorten the codewords for some outcomes, there must be other codewords 
that get longer, if the scheme is not lossy. In this chapter we will discover the 
information-theoretic equivalent of water volume. 


Before reading Chapter 5, you should have worked on exercise 2.26 (p.37). 
We will use the following notation for intervals: 


x €[1,2) means that x > 1 and x < 2; 
x €(1,2] means that x > 1 and z < 2. 
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Symbol Codes 


In this chapter, we discuss variable-length symbol codes, which encode one 
source symbol at a time, instead of encoding huge strings of N source sym- 
bols. These codes are lossless: unlike the last chapter’s block codes, they are 
guaranteed to compress and decompress without any errors; but there is a 
chance that the codes may sometimes produce encoded strings longer than 
the original source string. 

The idea is that we can achieve compression, on average, by assigning 
shorter encodings to the more probable outcomes and longer encodings to the 
less probable. 

The key issues are: 


What are the implications if a symbol code is lossless? If some code- 
words are shortened, by how much do other codewords have to be length- 
ened? 


Making compression practical. How can we ensure that a symbol code is 
easy to decode? 


Optimal symbol codes. How should we assign codelengths to achieve the 
best compression, and what is the best achievable compression? 


We again verify the fundamental status of the Shannon information content 
and the entropy, proving: 


Source coding theorem (symbol codes). There exists a variable-length 
encoding C of an ensemble X such that the average length of an en- 


coded symbol, L(C, X), satisfies L(C, X) € [H(X), H(X) +1). 


The average length is equal to the entropy H(X) only if the codelength 
for each outcome is equal to its Shannon information content. 


We will also define a constructive procedure, the Huffman coding algorithm, 
that produces optimal symbol codes. 


Notation for alphabets. AN denotes the set of ordered N-tuples of ele- 
ments from the set A, i.e., all strings of length N. The symbol At will 
denote the set of all strings of finite length composed of elements from 
the set A. 

Example 5.1. {0,1}% = {000, 001, 010, 011, 100, 101, 110, 111}. 

Example 5.2. {0,1}+ = {0,1, 00,01, 10, 11,000, 001,...}. 
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92 5 — Symbol Codes 


> 5.1 Symbol codes 


A (binary) symbol code C for an ensemble X is a mapping from the range 
of x, Ax ={a1,..., ar}, to {0,1}*. c(x) will denote the codeword cor- 
responding to x, and I(x) will denote its length, with l; = I(a,). 


The extended code C* is a mapping from A$ to {0,1}* obtained by 
concatenation, without punctuation, of the corresponding codewords: 


ert 


z1£2... Ny) = C(x1)c(xq)...c(ay). (5.1) 


[The term ‘mapping’ here is a synonym for ‘function’.] 


Example 5.3. A symbol code for the ensemble X defined by 


Ax = {a,b,c,d}, re 
Py = a 1/4, 1/8, in (i2) i OG) Me 
a 1000 4 
is Co, shown in the margin. Co: b 0100 4 
Using the extended code, we may encode acdbac as c 0010 4 
d 0001 4 
ct(acdbac) = 100000100001010010000010. (5.3) 


There are basic requirements for a useful symbol code. First, any encoded 
string must have a unique decoding. Second, the symbol code must be easy to 
decode. And third, the code should achieve as much compression as possible. 


Any encoded string must have a unique decoding 


A code C(X) is uniquely decodeable if, under the extended code C+, no 
two distinct strings have the same encoding, i.e., 


Veye At, x#y => c+(x)#c™(y). (5.4) 
The code Co defined above is an example of a uniquely decodeable code. 


The symbol code must be easy to decode 


A symbol code is easiest to decode if it is possible to identify the end of a 
codeword as soon as it arrives, which means that no codeword can be a prefix 
of another codeword. [A word c is a prefix of another word d if there exists a 
tail string t such that the concatenation ct is identical to d. For example, 1 is 
a prefix of 101, and so is 10.] 

We will show later that we don’t lose any performance if we constrain our 
symbol code to be a prefix code. 


A symbol code is called a prefix code if no codeword is a prefix of any 
other codeword. 


A prefix code is also known as an instantaneous or self-punctuating code, 
because an encoded string can be decoded from left to right without 
looking ahead to subsequent codewords. The end of a codeword is im- 
mediately recognizable. A prefix code is uniquely decodeable. 


Prefix codes are also known as ‘prefix-free codes’ or ‘prefix condition codes’. 


Prefix codes correspond to trees, as illustrated in the margin of the next page. 
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5.1: Symbol codes 


Example 5.4. The code C1 = {0,101} is a prefix code because 0 is not a prefix 
of 101, nor is 101 a prefix of 0. 


Example 5.5. Let C2 = {1,101}. This code is not a prefix code because 1 is a 
prefix of 101. 


Example 5.6. The code C3 = {0, 10,110,111} is a prefix code. 
Example 5.7. The code C4 = {00, 01,10, 11} is a prefix code. 


Exercise 5.9.4 P-104] J, C2 uniquely decodeable? 

Example 5.9. Consider exercise 4.1 (p.66) and figure 4.2 (p.69). Any weighing 
strategy that identifies the odd ball and whether it is heavy or light can 
be viewed as assigning a ternary code to each of the 24 possible states. 
This code is a prefix code. 


The code should achieve as much compression as possible 


The expected length L(C, X) of a symbol code C for ensemble X is 


L(C,X) = X` P(a)U(a). (5.5) 
reAx 
We may also write this quantity as 
I 
L(C,X) = SU pili (5.6) 
i=l 
where I = |Ax|. 
Example 5.10. Let 
Ax = { a, b, c, d }, (5.7) 
and Px = {12,14 V/s, 1/8}, i 


and consider the code C3. The entropy of X is 1.75 bits, and the expected 
length L(C3, X) of this code is also 1.75 bits. The sequence of symbols 
x = (acdbac) is encoded as ct (x) = 0110111100110. C3 is a prefix code 
and is therefore uniquely decodeable. Notice that the codeword lengths 
satisfy l; = logy(1/p;), or equivalently, p; = 27". 


Example 5.11. Consider the fixed length code for the same ensemble X, C4. 
The expected length L(C4, X) is 2 bits. 


Example 5.12. Consider C5. The expected length L(C5, X) is 1.25 bits, which 
is less than H(X). But the code is not uniquely decodeable. The se- 
quence x =(acdbac) encodes as 000111000, which can also be decoded 
as (cabdca). 


Example 5.13. Consider the code Cg. The expected length L(C¢, X) of this 
code is 1.75 bits. The sequence of symbols x =(acdbac) is encoded as 
ct+(x) = 0011111010011. 


Is Cg a prefix code? It is not, because c(a) = 0 is a prefix of both c(b) 
and c(c). 


93 
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Prefix codes can be represented 
on binary trees. Complete prefix 
codes correspond to binary trees 
with no unused branches. Cı is an 
incomplete code. 


C3: 
ai cai) pi hip) li 
a 0 12 10 1 
b 10 14 20 2 
c 110 Iš% 30 3 
a 111 Ys 30 3 
Cy Cs 
a 00 
b 01 1 
c 10 00 
d 11 11 
Ce: 
ai c(ai) pi hip) l 
a 0 12 10 1 
b 01 14 20 2 
c 011 Ig 30 3 
a 111 Ys 30 3 
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94 5 — Symbol Codes 


Is Cg uniquely decodeable? This is not so obvious. If you think that it 
might not be uniquely decodeable, try to prove it so by finding a pair of 
strings x and y that have the same encoding. [The definition of unique 
decodeability is given in equation (5.4).] 


Ce certainly isn’t easy to decode. When we receive ‘00’, it is possible 
that x could start ‘aa’, ‘ab’ or ‘ac’. Once we have received ‘001111’, 
the second symbol is still ambiguous, as x could be ‘abd...’ 
But eventually a unique decoding crystallizes, once the next 0 appears 
in the encoded stream. 


or ‘acd...’. 


Cs is in fact uniquely decodeable. Comparing with the prefix code C3, 
we see that the codewords of Cg are the reverse of C3’s. That C3 is 
uniquely decodeable proves that Cg is too, since any string from C% is 
identical to a string from C3 read backwards. 


> 5.2 What limit is imposed by unique decodeability? 


We now ask, given a list of positive integers {l;}, does there exist a uniquely 
decodeable code with those integers as its codeword lengths? At this stage, we 
ignore the probabilities of the different symbols; once we understand unique 
decodeability better, we’ll reintroduce the probabilities and discuss how to 
make an optimal uniquely decodeable symbol code. 

In the examples above, we have observed that if we take a code such as 
{00,01, 10,11}, and shorten one of its codewords, for example 00 — 0, then 
we can retain unique decodeability only if we lengthen other codewords. Thus 
there seems to be a constrained budget that we can spend on codewords, with 
shorter codewords being more expensive. 

Let us explore the nature of this budget. If we build a code purely from 
codewords of length l equal to three, how many codewords can we have and 
retain unique decodeability? The answer is 2' = 8. Once we have chosen all 
eight of these codewords, is there any way we could add to the code another 
codeword of some other length and retain unique decodeability? It would 
seem not. 

What if we make a code that includes a length-one codeword, ‘0’, with the 
other codewords being of length three? How many length-three codewords can 
we have? If we restrict attention to prefix codes, then we can have only four 
codewords of length three, namely {100,101,110,111}. What about other 
codes? Is there any other way of choosing codewords of length 3 that can give 
more codewords? Intuitively, we think this unlikely. A codeword of length 3 
appears to have a cost that is 2? times smaller than a codeword of length 1. 

Let’s define a total budget of size 1, which we can spend on codewords. If 
we set the cost of a codeword whose length is / to 2~!, then we have a pricing 
system that fits the examples discussed above. Codewords of length 3 cost 
1/8 each; codewords of length 1 cost 1/2 each. We can spend our budget on 
any codewords. If we go over our budget then the code will certainly not be 
uniquely decodeable. If, on the other hand, 


Deas (5.8) 


then the code may be uniquely decodeable. This inequality is the Kraft in- 
equality. 


Kraft inequality. For any uniquely decodeable code C(X) over the binary 
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5.2: What limit is imposed by unique decodeability? 


alphabet {0,1}, the codeword lengths must satisfy: 


f 
MmT, (5.9) 
i=1 


where I = |Ax\. 


Completeness. If a uniquely decodeable code satisfies the Kraft inequality 
with equality then it is called a complete code. 


We want codes that are uniquely decodeable; prefix codes are uniquely de- 
codeable, and are easy to decode. So life would be simpler for us if we could 
restrict attention to prefix codes. Fortunately, for any source there is an op- 
timal symbol code that is also a prefix code. 


Kraft inequality and prefix codes. Given a set of codeword lengths that 
satisfy the Kraft inequality, there exists a uniquely decodeable prefix 
code with these codeword lengths. 


The Kraft inequality might be more accurately referred to as the Kraft- 
McMillan inequality: Kraft proved that if the inequality is satisfied, then a 
prefix code exists with the given lengths. McMillan (1956) proved the con- 
verse, that unique decodeability implies that the inequality holds. 


Proof of the Kraft inequality. Define S = >>, 2-4, Consider the quantity 


N I I I 
GN — Ee =Y Va a th toby), (5.10) 
i el 


i1=1l i2=1 i 


The quantity in the exponent, (li + liz +++: + liy), is the length of the 
encoding of the string X = i aiz ...@i,y. For every string x of length N, 
there is one term in the above sum. Introduce an array A; that counts 
how many strings x have encoded length l. Then, defining lmin = min; l; 
and lmax = max; li: 

Nlmax 


S= X VA. (5.11) 
l=Nlmin 


Now assume C is uniquely decodeable, so that for all x 4 y, c™(x) Æ 
ct(y). Concentrate on the x that have encoded length l. There are a 
total of 2! distinct bit strings of length l, so it must be the case that 


Ai < 2l. So 
Nlmax Nlmax 
S= X 2'4< XD 1 < Nima (5.12) 
l=Nlmin l=Nlmin 


Thus SN < lmax N for all N. Now if S were greater than 1, then as N 
increases, S would be an exponentially growing function, and for large 
enough N, an exponential always exceeds a polynomial such as lmax V. 
But our result (SN < lmax V) is true for any N. Therefore S < 1. o 





> Exercise 5.14.1% P104] Prove the result stated above, that for any set of code- 
word lengths {l;} satisfying the Kraft inequality, there is a prefix code 
having those lengths. 
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5 — Symbol Codes 


Figure 5.1. The symbol coding 
budget. The ‘cost’ 2~! of each 
codeword (with length /) is 
indicated by the size of the box it 
is written in. The total budget 
available when making a uniquely 
decodeable code is 1. 

You can think of this diagram as 
showing a codeword supermarket, 
with the codewords arranged in 
aisles by their length, and the cost 
of each codeword indicated by the 
size of its box on the shelf. If the 
cost of the codewords that you 
take exceeds the budget then your 
code will not be uniquely 
decodeable. 
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Figure 5.2. Selections of 
codewords made by codes 

Co, C3, C4 and Ce from section 
5.1. 
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5.3: What’s the most compression that we can hope for? 


A pictorial view of the Kraft inequality may help you solve this exercise. 
Imagine that we are choosing the codewords to make a symbol code. We can 
draw the set of all candidate codewords in a supermarket that displays the 
‘cost’ of the codeword by the area of a box (figure 5.1). The total budget 
available — the ‘1’ on the right-hand side of the Kraft inequality — is shown at 
one side. Some of the codes discussed in section 5.1 are illustrated in figure 
5.2. Notice that the codes that are prefix codes, Co, C3, and C4, have the 
property that to the right of any selected codeword, there are no other selected 
codewords — because prefix codes correspond to trees. Notice that a complete 
prefix code corresponds to a complete tree having no unused branches. 


We are now ready to put back the symbols’ probabilities {p;}. Given a 
set of symbol probabilities (the English language probabilities of figure 2.1, 
for example), how do we make the best symbol code — one with the smallest 
possible expected length L(C, X)? And what is that smallest possible expected 
length? It’s not obvious how to assign the codeword lengths. If we give short 
codewords to the more probable symbols then the expected length might be 
reduced; on the other hand, shortening some codewords necessarily causes 
others to lengthen, by the Kraft inequality. 


> 5.3 What’s the most compression that we can hope for? 


We wish to minimize the expected length of a code, 
L(C,X) = Ñ pil (5.13) 


As you might have guessed, the entropy appears as the lower bound on the 
expected length of a code. 


Lower bound on expected length. The expected length L(C,X) of a 
uniquely decodeable code is bounded below by H(X). 

Proof. We define the implicit probabilities q; = 27" /z, where z = DDA 27 so 
that l; = log 1/q; — log z. We then use Gibbs’ inequality, X; p; log 1/q > 
X; pi log 1/pi, with equality if qi= p;, and the Kraft inequality z < 1: 


L(C,X) = X pili = S > pj log 1/q; — log z (5.14) 

> X pilog 1/p; — log z (5.15) 

> H(X). (5.16) 

The equality L(C, X) = H(X) is achieved only if the Kraft equality z=1 
is satisfied, and if the codelengths satisfy 1; = log(1/p;). o 





This is an important result so let’s say it again: 
Optimal source codelengths. The expected length is minimized and is 


equal to H(X) only if the codelengths are equal to the Shannon in- 
formation contents: 


l; = loga(1/p;). (5.17) 

Implicit probabilities defined by codelengths. Conversely, any choice 
of codelengths {l;} implicitly defines a probability distribution {q;}, 

qi =r" /z, (5.18) 

for which those codelengths would be the optimal codelengths. If the 


code is complete then z = 1 and the implicit probabilities are given by 
q=". 
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98 5 — Symbol Codes 


> 5.4 How much can we compress? 
So, we can’t compress below the entropy. How close can we expect to get to 
the entropy? 


Theorem 5.1 Source coding theorem for symbol codes. For an ensemble X 
there exists a prefix code C with expected length satisfying 


H(X) < L(C,X) < H(X)+1. (5.19) 


Proof. We set the codelengths to integers slightly larger than the optimum 
lengths: 

l; = [loga(1/p:)] (5.20) 
where [1*] denotes the smallest integer greater than or equal to l*. [We 
are not asserting that the optimal code necessarily uses these lengths, 
we are simply choosing these lengths because we can use them to prove 
the theorem] 


We check that there is a prefix code with these lengths by confirming 
that the Kraft inequality is satisfied. 


oo = S52 Hesa(t/s)| < So oan) — X pi i (5.21) 


Then we confirm 


L(C,X) = E pillos(1/pi)] < Y pilog(/p:)+1) = H(X) +1. (6.22) 


P(x) 


0.0575 
0.0128 
0.0263 
0.0285 
0.0913 
0.0173 
0.0133 
0.0313 
0.0599 
0.0006 
0.0084 
0.0335 
0.0235 
0.0596 
0.0689 
0.0192 
0.0008 
0.0508 
0.0567 
0.0706 
0.0334 
0.0069 
0.0119 
0.0073 
0.0164 
0.0007 
0.1928 














The cost of using the wrong codelengths 


If we use a code whose lengths are not equal to the optimal codelengths, the 
average message length will be larger than the entropy. 

If the true probabilities are {p;} and we use a complete code with lengths 
l;, we can view those lengths as defining implicit probabilities q; = 2~". Con- 
tinuing from equation (5.14), the average length is 


L(C, X) = H(X) +) pilogps/ai (5.23) 


i.e., it exceeds the entropy by the relative entropy Dx1(p||q) (as defined on 
p.34). 


INSHM ESE GtHHOATDOBBHERUE SD OOMHORATD X 


> 5.5 Optimal source coding with symbol codes: Huffman coding 


Given a set of probabilities P, how can we design an optimal prefix code? 

For example, what is the best symbol code for the English language ensemble Figure 5.3. An ensemble in need of 
shown in figure 5.3? When we say ‘optimal’, let’s assume our aim is to a symbol code. 

minimize the expected length L(C, X). 


How not to do it 


One might try to roughly split the set Ax in two, and continue bisecting the 
subsets so as to define a binary tree from the root. This construction has the 
right spirit, as in the weighing problem, but it is not necessarily optimal; it 
achieves L(C, X) < H(X) +2. 
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The Huffman coding algorithm 


We now present a beautifully simple algorithm for finding an optimal prefix 
code. The trick is to construct the code backwards starting from the tails of 
the codewords; we build the binary tree from its leaves. 


Algorithm 5.4. Huffman coding 
. Take the two least probable symbols in the alphabet. These two algorithm. 

symbols will be given the longest codewords, which will have equal 

length, and differ only in the last digit. 


. Combine these two symbols into a single symbol, and repeat. 





Since each step reduces the size of the alphabet by one, this algorithm will 
have assigned strings to all the symbols after |Ax|— 1 steps. 


Example 5.15. Let Ax={a, b c, d, e } 
and Py ={0.25,0.25, 0.2, 0.15, 0.15 }. 


x step 1 step2  step3 step 4 
‘ j A 7 ai pi h(pi) li c(ai) 
a 0.25 0.25 F 0.25 0.55 1.0 a 025 20 2 00 
b 0.25 — oe 0.45 0.45% 1 b 0.25 20 2 10 
c 0.2 pa 0.2 1 c 0.2 23 2 11 
0.1577 03 — 0.3 /1 a 015 27 3 010 
0.15%1 e 0.15 2.7 3 O11 


The codewords are then obtained by concatenating the binary digits in Table 5.5. Code created by the 
reverse order: C = {00,10,11,010,011}. The codelengths selected Huffman algorithm. 

by the Huffman algorithm (column 4 of table 5.5) are in some cases 

longer and in some cases shorter than the ideal codelengths, the Shannon 

information contents logs ¥/p; (column 3). The expected length of the 

code is L = 2.30 bits, whereas the entropy is H = 2.2855 bits. 














If at any point there is more than one way of selecting the two least probable 
symbols then the choice may be made in any manner — the expected length of 
the code will not depend on the choice. 


Exercise 5.16.1% P-105] Prove that there is no better symbol code for a source 
than the Huffman code. 


Example 5.17. We can make a Huffman code for the probability distribution 
over the alphabet introduced in figure 2.1. The result is shown in fig- 
ure 5.6. This code has an expected length of 4.15 bits; the entropy of 
the ensemble is 4.11 bits. Observe the disparities between the assigned 
codelengths and the ideal codelengths logs 1/p:. 


Constructing a binary tree top-down is suboptimal 


In previous chapters we studied weighing problems in which we built ternary 
or binary trees. We noticed that balanced trees — ones in which, at every step, 
the two possible outcomes were as close as possible to equiprobable — appeared 
to describe the most efficient experiments. This gave an intuitive motivation 
for entropy as a measure of information content. 
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Q 
= 


Pi 


0.0575 
0.0128 
0.0263 
0.0285 
0.0913 
0.0173 
0.0133 
0.0313 
0.0599 
0.0006 
0.0084 
0.0335 
0.0235 
0.0596 
0.0689 
0.0192 
0.0008 
0.0508 
0.0567 
0.0706 
0.0334 
0.0069 
0.0119 
0.0073 
0.0164 
0.0007 
0.1928 


NS MH Sago et¢ NH QDoOoBBRrPROY POW HOaaAT Pp 


logs > 


mo e 
NGO m 


D w o 
Nw i e Or 


Ao 
=. © 


10.7 
6.9 
4.9 
5.4 
4.1 
3.9 
5.7 

10.3 
4.3 
4.1 
3.8 
4.9 
7.2 
6.4 
Cel, 
5.9 

10.4 
2.4 


~ 
S 


p 


pi 
NOAN N SODNA AA SOCOonNMRARADOANODORARAAaADAD’ORARAaAJTO A 


c(ai) 


0000 
001000 
00101 
10000 
1100 
111000 
001001 
10001 
1001 
1101000000 
1010000 
11101 
110101 
0001 

1011 
111001 
110100001 
11011 
0011 

1111 
10101 
11010001 
1101001 
1010001 
101001 
1101000001 
01 


It is not the case, however, that optimal codes can always be constructed 
by a greedy top-down method in which the alphabet is successively divided 
into subsets that are as near as possible to equiprobable. 


Example 5.18. Find the optimal binary symbol code for the ensemble: 


Ax = { a, b, C, d, e, f, g } 
Px = { 0.01, 0.24, 0.05, 0.20, 0.47, 0.01, 0.02} ` 


(5.24) 


Notice that a greedy top-down method can split this set into two sub- 
sets {a,b,c,d} and {e,f,g} which both have probability 1/2, and that 
{a, b, c,d} can be divided into subsets {a,b} and {c,d}, which have prob- 
ability 1/4; so a greedy top-down method gives the code shown in the 
third column of table 5.7, which has expected length 2.53. The Huffman 
coding algorithm yields the code shown in the fourth column, which has 


expected length 1.97. 














> 5.6 Disadvantages of the Huffman code 


The Huffman algorithm produces an optimal symbol code for an ensemble, 
but this is not the end of the story. Both the word ‘ensemble’ and the phrase 
‘symbol code’ need careful attention. 


Changing ensemble 


If we wish to communicate a sequence of outcomes from one unchanging en- 
semble, then a Huffman code may be convenient. But often the appropriate 





5 — Symbol Codes 


Figure 5.6. Huffman code for the 
English language ensemble 


(monogram statistics). 


Qi Pi 

a Ol 
b .24 
c .05 
d .20 
e AT 
f Ol 
g  .02 


Greedy 


000 
001 
010 
011 
10 

110 
111 


Huffman 


000000 
01 
0001 
001 

1 
000001 
00001 


Table 5.7. A greedily-constructed 
code compared with the Huffman 


code. 
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ensemble changes. If for example we are compressing text, then the symbol 
frequencies will vary with context: in English the letter u is much more prob- 
able after a q than after an e (figure 2.3). And furthermore, our knowledge of 
these context-dependent symbol frequencies will also change as we learn the 
statistical properties of the text source. 


Huffman codes do not handle changing ensemble probabilities with any 
elegance. One brute-force approach would be to recompute the Huffman code 
every time the probability over symbols changes. Another attitude is to deny 
the option of adaptation, and instead run through the entire file in advance 
and compute a good probability distribution, which will then remain fixed 
throughout transmission. The code itself must also be communicated in this 
scenario. Such a technique is not only cumbersome and restrictive, it is also 
suboptimal, since the initial message specifying the code and the document 
itself are partially redundant. This technique therefore wastes bits. 


The extra bit 


An equally serious problem with Huffman codes is the innocuous-looking ‘ex- 
tra bit’ relative to the ideal average length of H(X) — a Huffman code achieves 
a length that satisfies H(X) < L(C,X) < H(X)-+1, as proved in theorem 5.1. 
A Huffman code thus incurs an overhead of between 0 and 1 bits per symbol. 
If H(X) were large, then this overhead would be an unimportant fractional 
increase. But for many applications, the entropy may be as low as one bit 
per symbol, or even smaller, so the overhead L(C, X) — H(X) may domi- 
nate the encoded file length. Consider English text: in some contexts, long 
strings of characters may be highly predictable. For example, in the context 
‘strings_of_ch’, one might predict the next nine symbols to be ‘aracters_’ 
with a probability of 0.99 each. A traditional Huffman code would be obliged 
to use at least one bit per character, making a total cost of nine bits where 
virtually no information is being conveyed (0.13 bits in total, to be precise). 
The entropy of English, given a good model, is about one bit per character 
(Shannon, 1948), so a Huffman code is likely to be highly inefficient. 


A traditional patch-up of Huffman codes uses them to compress blocks of 
symbols, for example the ‘extended sources’ XN we discussed in Chapter 4. 
The overhead per block is at most 1 bit so the overhead per symbol is at most 
1/N bits. For sufficiently large blocks, the problem of the extra bit may be 
removed — but only at the expenses of (a) losing the elegant instantaneous 
decodeability of simple Huffman coding; and (b) having to compute the prob- 
abilities of all relevant strings and build the associated Huffman tree. One will 
end up explicitly computing the probabilities and codes for a huge number of 
strings, most of which will never actually occur. (See exercise 5.29 (p.103).) 


Beyond symbol codes 


Huffman codes, therefore, although widely trumpeted as ‘optimal’, have many 
defects for practical purposes. They are optimal symbol codes, but for practi- 
cal purposes we don’t want a symbol code. 


The defects of Huffman codes are rectified by arithmetic coding, which 
dispenses with the restriction that each symbol must translate into an integer 
number of bits. Arithmetic coding is the main topic of the next chapter. 
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> 5.7 Summary 


Kraft inequality. If a code is uniquely decodeable its lengths must satisfy 
yoo ed, (5.25) 


For any lengths satisfying the Kraft inequality, there exists a prefix code 
with those lengths. 


Optimal source codelengths for an ensemble are equal to the Shannon 
information contents 


1 
li = logy —, (5.26) 
Pi 


and conversely, any choice of codelengths defines implicit probabilities 
27h 
= 





qi = (5.27) 


The relative entropy Dx (p||q) measures how many bits per symbol are 
wasted by using a code whose implicit probabilities are q, when the 
ensemble’s true probability distribution is p. 


Source coding theorem for symbol codes. For an ensemble X, there ex- 
ists a prefix code whose expected length satisfies 


H(X) < L(C, X) < H(X)+1. (5.28) 


The Huffman coding algorithm generates an optimal symbol code itera- 
tively. At each iteration, the two least probable symbols are combined. 


»> 5.8 Exercises 


> Exercise 5.19.12] Is the code {00, 11,0101, 111, 1010, 100100, 0110} uniquely 
decodeable? 


> Exercise 5.20.!#] Is the ternary code {00,012,0110,0112, 100, 201, 212, 22} 


uniquely decodeable? 





Exercise 5.21.9 P-106] Make Huffman codes for X?, X? and X4 where Ay = 
> {0,1} and Px = {0.9,0.1}. Compute their expected lengths and com- 
pare them with the entropies H(X*), H(X?) and H(X*). 


Repeat this exercise for X? and X* where Px = {0.6,0.4}. 


“> Exercise 5.22.1% P106] Find a probability distribution {p1, p2, p3, p4} such that 
> there are two optimal codes that assign different lengths {l;} to the four 
symbols. 


Exercise 5.23.1] (Continuation of exercise 5.22.) Assume that the four proba- 
bilities {p1, p2, p3, pa} are ordered such that pı > po > p3 > p4 > 0. Let 
Q be the set of all probability vectors p such that there are two optimal 
codes with different lengths. Give a complete description of Q. Find 
three probability vectors q®, q), q6), which are the convex hull of Q, 
i.e., such that any p € Q can be written as 


( (2) 





p = ma + poq? + pq, (5.29) 


where {j1;} are positive. 
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> Exercise 5.24.!4] Write a short essay discussing how to play the game of twenty 
questions optimally. [In twenty questions, one player thinks of an object, 
and the other player has to guess the object using as few binary questions 
as possible, preferably fewer than twenty.| 

> Exercise 5.25.!7] Show that, if each probability p; is equal to an integer power 
of 2 then there exists a source code whose expected length equals the 
entropy. 


> Exercise 5.26.1 P106] Make ensembles for which the difference between the 
entropy and the expected length of the Huffman code is as big as possible. 


> Exercise 5.27.1% P106] A source X has an alphabet of eleven characters 
{a, b, c,d,e,f,g, h, i,j,k}, 


all of which have equal probability, 1/11. 


Find an optimal uniquely decodeable symbol code for this source. How 
much greater is the expected length of this optimal code than the entropy 
of X? 


> Exercise 5.28.!7] Consider the optimal symbol code for an ensemble X with 
alphabet size I from which all symbols have identical probability p = 
1/I. I is not a power of 2. 


Show that the fraction f+ of the I symbols that are assigned codelengths 


equal to 
I* = flogs I] (5.30) 
satisfies 
a 
ft=2- T (5.31) 


and that the expected length of the optimal symbol code is 
L=(t—-1+ ft. (5.32) 


By differentiating the excess length AL = L — H(X) with respect to J, 
show that the excess length is bounded by 
In(In 2) 1 


AL<1 — = 0.086. i 
< a py = 0086 (5.33) 


Exercise 5.29.7] Consider a sparse binary source with Px = {0.99,0.01}. Dis- 
>= cuss how Huffman codes could be used to compress this source efficiently. 
Estimate how many codewords your proposed solutions require. 


[2] 


> Exercise 5.30. Scientific American carried the following puzzle in 1975. 


The poisoned glass. ‘Mathematicians are curious birds’, the police 
commissioner said to his wife. ‘You see, we had all those partly 
filled glasses lined up in rows on a table in the hotel kitchen. Only 
one contained poison, and we wanted to know which one before 
searching that glass for fingerprints. Our lab could test the liquid 
in each glass, but the tests take time and money, so we wanted to 
make as few of them as possible by simultaneously testing mixtures 
of small samples from groups of glasses. The university sent over a 
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mathematics professor to help us. He counted the glasses, smiled 
and said: 

‘ “Pick any glass you want, Commissioner. We’ll test it first.” 
‘“But won’t that waste a test?” I asked. 

‘“No,” he said, “it’s part of the best procedure. We can test one 
glass first. It doesn’t matter which one.” ’ 

‘How many glasses were there to start with?’ the commissioner’s 
wife asked. 

‘T don’t remember. Somewhere between 100 and 200.’ 

What was the exact number of glasses? 


Solve this puzzle and then explain why the professor was in fact wrong 
and the commissioner was right. What is in fact the optimal procedure 
for identifying the one poisoned glass? What is the expected waste 
relative to this optimum if one followed the professor’s strategy? Explain 
the relationship to symbol coding. 


Exercise 5.31.1 P-106] Assume that a sequence of symbols from the ensemble 
> X introduced at the beginning of this chapter is compressed using the 
code C3. Imagine picking one bit at random from the binary encoded 
sequence ¢ = c(x1)c(x2)c(a3)... . What is the probability that this bit 

isa 1? 


C3: 


ai c(ai) pi hip) | 
0 12 1.0 
10 1⁄4 20 
110 Ts 3.0 
111 1⁄8 30 


S 


> Exercise 5.32.1% P107] How should the binary Huffman encoding scheme be 
modified to make optimal symbol codes in an encoding alphabet with q 
symbols? (Also known as ‘radix g.) 


aaow 
Wwwnr 


Mixture codes 


It is a tempting idea to construct a ‘metacode’ from several symbol codes that 
assign different-length codewords to the alternative symbols, then switch from 
one code to another, choosing whichever assigns the shortest codeword to the 
current symbol. Clearly we cannot do this for free. If one wishes to choose 
between two codes, then it is necessary to lengthen the message in a way that 
indicates which of the two codes is being used. If we indicate this choice by 
a single leading bit, it will be found that the resulting code is suboptimal 
because it is incomplete (that is, it fails the Kraft equality). 


“Exercise 5.33.1% P108] Prove that this metacode is incomplete, and explain 
> why this combined code is suboptimal. 


> 5.9 Solutions 


Solution to exercise 5.8 (p.93). Yes, C2 = {1,101} is uniquely decodeable, 
even though it is not a prefix code, because no two different strings can map 
onto the same string; only the codeword c(a2) = 101 contains the symbol 0. 


Solution to exercise 5.14 (p.95). We wish to prove that for any set of codeword 
lengths {l;} satisfying the Kraft inequality, there is a prefix code having those 
lengths. This is readily proved by thinking of the codewords illustrated in 
figure 5.8 as being in a ‘codeword supermarket’, with size indicating cost. 
We imagine purchasing codewords one at a time, starting from the shortest 
codewords (i.e., the biggest purchases), using the budget shown at the right 
of figure 5.8. We start at one side of the codeword supermarket, say the 
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0000 Figure 5.8. The codeword 
me 000 supermarket and the symbol 
w 0010 d coding budget. The ‘cost’ 27! of 
001 Aa = each codeword (with length l) is 
0 3 indicated by the size of the box it 
oi 0100 o is written in. The total budget 
òl 010 z available when making a uniquely 
vii 0110 9 decodeable code is 1. 
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symbol probability Huffman Rival code’s Modified rival Figure 5.9. Proof that Huffman 
Jewords -codewords code coding makes an optimal symbol 

S code. We assume that the rival 
ü exla ae (a bate code, which is said to be optimal, 
Po O RCo) assigns unequal length codewords 
b cub cr(b cr(b to the two symbols with smallest 

ee n(o) R(b) R(b) probability, a and b. By 

c Pe EQ cu(c) cr(c) cr(a) interchanging codewords a and c 


of the rival code, where c is a 
symbol with rival codelength as 
long as 6’s, we can make a code 


top, and purchase the first codeword of the required length. We advance better than the rival code. This 
down the supermarket a distance 2~', and purchase the next codeword of the shows that the rival code was not 
next required length, and so forth. Because the codeword lengths are getting Optimal. 

longer, and the corresponding intervals are getting shorter, we can always 

buy an adjacent codeword to the latest purchase, so there is no wasting of 

the budget. Thus at the Ith codeword we have advanced a distance yo 27h 

down the supermarket; if X> 27li < 1, we will have purchased all the codewords 

without running out of budget. 


Solution to exercise 5.16 (p.99). The proof that Huffman coding is optimal 
depends on proving that the key step in the algorithm — the decision to give 
the two symbols with smallest probability equal encoded lengths — cannot 
lead to a larger expected length than any other code. We can prove this by 
contradiction. 

Assume that the two symbols with smallest probability, called a and b, 
to which the Huffman algorithm would assign equal length codewords, do not 
have equal lengths in any optimal symbol code. The optimal symbol code 
is some other rival code in which these two codewords have unequal lengths 
la and I, with la < lẹ. Without loss of generality we can assume that this 
other code is a complete prefix code, because any codelengths of a uniquely 
decodeable code can be realized by a prefix code. 

In this rival code, there must be some other symbol c whose probability 
De is greater than pa and whose length in the rival code is greater than or 
equal to lp, because the code for b must have an adjacent codeword of equal 
or greater length — a complete prefix code never has a solo codeword of the 
maximum length. 

Consider exchanging the codewords of a and c (figure 5.9), so that a is 
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encoded with the longer codeword that was c’s, and c, which is more probable 
than a, gets the shorter codeword. Clearly this reduces the expected length 
of the code. The change in expected length is (pa — pe) (le — la). Thus we have 
contradicted the assumption that the rival code is optimal. Therefore it is 
valid to give the two symbols with smallest probability equal encoded lengths. 








Huffman coding produces optimal symbol codes. o an pi i dad 

Solution to exercise 5.21 (p.102). A Huffman code for X? where Ax = {0,1} 0000 0.1296 3 000 
and Px = {0.9,0.1} is {00,01,10,11} — {1,01,000,001}. This code has ke ees ‘ hae 

2 S 2 . . 

L(C, X4) = 1.29, whereas ihe entropy H(X*4) is 0.938. Giese: A oii 

A Huffman code for X? is 1000 0.0864 3 100 
{000, 100, 010, 001, 101,011, 110,111} — aes rsa : ee 

{1, 0411, 010, 001, 00000, 00001, 00010, 00011}. SAE. 0e A 

$ . 0110 0.0576 4 1110 
This has expected length L(C, X’) = 1.598 whereas the entropy H(X®) is Siok) 0.0576: v4 iiit 
1.4069. 0011 0.0576 4 0010 
A Huffman code for X4 maps the sixteen source strings to the following 1110 0.0384 5 00110 
codelengths: 1101 0.0384 5 01010 
1011 0.0384 5 01011 

{0000, 1000, 0100, 0010, 0001, 1100, 0110, 0011, 0101, 1010, 1001, 1110, 1101, 0111 0.0384 4 1011 
1011,0111,1111} > {1,3,3,3,4,6, 7, 7, 7, 7, 7,9, 9,9, 10, 10}. TLE 00206, 5: £00 


This has expected length L(C, X4) = 1.9702 whereas the entropy H(X*) is Table 5.10. Huffman code for X4 
1.876. when po = 0.6. Column 3 shows 
When Px = {0.6,0.4}, the Huffman code for X? has lengths {2,2,2,2}; the assigned codelengths and 
the expected length is 2 bits, and the entropy is 1.94 bits. A Huffman code for column 4 the codewords. Some 


Xt is shown in table 5.10. The expected length is 3.92 bits, and the entropy Strings whose probabilities are 
is 3.88 bits identical, e.g., the fourth and 


fifth, receive different codelengths. 


Solution to exercise 5.22 (p.102). The set of probabilities {p1, p2, p3, pa} = 
(Ye, 1/6, 1/3, 1/3} gives rise to two different optimal sets of codelengths, because 
at the second step of the Huffman coding algorithm we can choose any of the 
three possible pairings. We may either put them in a constant length code 
{00, 01,10, 11} or the code {000, 001, 01,1}. Both codes have expected length 
2. 

Another solution is {p1, p2, p3, pa} = {1/5, 1/5, 1/5, 2/5}. 

And a third is {p1, po, p3, pa} = {1/3, 1/3, 1/3, 0}. 


Solution to exercise 5.26 (p.103). Let Pmax be the largest probability in 
P1,P2,---,pr. The difference between the expected length L and the entropy 
H can be no bigger than max(pmax, 0.086) (Gallager, 1978). 

See exercises 5.27—5.28 to understand where the curious 0.086 comes from. 


Solution to exercise 5.27 (p.103). Length — entropy = 0.086. 


Solution to exercise 5.31 (p.104). There are two ways to answer this problem 
correctly, and one popular way to answer it incorrectly. Let’s give the incorrect 
answer first: 


Erroneous answer. “We can pick a random bit by first picking a random 


source symbol x; with probability p;, then picking a random bit from a cai) pi h 
c(x;). If we define f; to be the fraction of the bits of c(x;) that are 1s, a 0 1⁄2 1 
we find C3: b 10 1⁄4 2 
BE Vg 3 

P(bit is 1) = ifi 5.34 ee in 
pie?) a? f pon d 111 Ys 3 


1/2 x 0 + 1/4 x 1/2 + 1/8 x 2/3 + 1/8 x 1 = 1/3.” (5.35) 
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This answer is wrong because it falls for the bus-stop fallacy, which was intro- 
duced in exercise 2.35 (p.38): if buses arrive at random, and we are interested 
in ‘the average time from one bus until the next’, we must distinguish two 
possible averages: (a) the average time from a randomly chosen bus until the 
next; (b) the average time between the bus you just missed and the next bus. 
The second ‘average’ is twice as big as the first because, by waiting for a bus 
at a random time, you bias your selection of a bus in favour of buses that 
follow a large gap. You’re unlikely to catch a bus that comes 10 seconds after 
a preceding bus! Similarly, the symbols c and d get encoded into longer-length 
binary strings than a, so when we pick a bit from the compressed string at 
random, we are more likely to land in a bit belonging to a c or ad than would 
be given by the probabilities p; in the expectation (5.34). All the probabilities 
need to be scaled up by l;, and renormalized. 


Correct answer in the same style. Every time symbol x; is encoded, l; 
bits are added to the binary string, of which fil; are 1s. The expected 
number of 1s added per symbol is 


SS peal (5.36) 
and the expected total number of bits added per symbol is 
X pili. (5.37) 


So the fraction of 1s in the transmitted string is 


X rifili 
yD: 
1/2 x 0 + 1/4 x 1+ 1/8 x2+1/⁄8x3 1⁄8 


=a n aks 


P(bit is 1) = (5.38) 


For a general symbol code and a general ensemble, the expectation (5.38) is 
the correct answer. But in this case, we can use a more powerful argument. 


Information-theoretic answer. The encoded string c is the output of an 
optimal compressor that compresses samples from X down to an ex- 
pected length of H(X) bits. We can’t expect to compress this data any 
further. But if the probability P(bit is 1) were not equal to 1/2 then it 
would be possible to compress the binary string further (using a block 
compression code, say). Therefore P(bit is 1) must be equal to 1/2; in- 
deed the probability of any sequence of l bits in the compressed stream 
taking on any particular value must be 2~'. The output of a perfect 
compressor is always perfectly random bits. 


To put it another way, if the probability P(bit is 1) were not equal to 
1/2, then the information content per bit of the compressed string would 
be at most H2(P(1)), which would be less than 1; but this contradicts 
the fact that we can recover the original data from c, so the information 
content per bit of the compressed string must be H(X)/L(C, X) =1. 


Solution to exercise 5.32 (p.104). The general Huffman coding algorithm for 
an encoding alphabet with q symbols has one difference from the binary case. 
The process of combining q symbols into 1 symbol reduces the number of 
symbols by q—1. So if we start with A symbols, we'll only end up with a 
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complete g-ary tree if Amod (q— 1) is equal to 1. Otherwise, we know that 
whatever prefix code we make, it must be an incomplete tree with a number 
of missing leaves equal, modulo (q—1), to Amod (q—1) — 1. For example, if 
a ternary tree is built for eight symbols, then there will unavoidably be one 
missing leaf in the tree. 

The optimal g-ary code is made by putting these extra leaves in the longest 
branch of the tree. This can be achieved by adding the appropriate number 
of symbols to the original source symbol set, all of these extra symbols having 
probability zero. The total number of leaves is then equal to r(q—1) + 1, for 
some integer r. The symbols are then repeatedly combined by taking the q 
symbols with smallest probability and replacing them by a single symbol, as 
in the binary Huffman coding algorithm. 


Solution to exercise 5.33 (p.104). We wish to show that a greedy metacode, 
which picks the code which gives the shortest encoding, is actually suboptimal, 
because it violates the Kraft inequality. 

We'll assume that each symbol x is assigned lengths /;,(x) by each of the 
candidate codes Cķ. Let us assume there are K alternative codes and that we 
can encode which code is being used with a header of length log K bits. Then 
the metacode assigns lengths l'(x) that are given by 


I(x) = logy K + min I, (a). (5.39) 


We compute the Kraft sum: 
1 1 : 
— =l (x) _ — ming lk (£) 
S= 2 =K? REE (5.40) 


Let’s divide the set Ax into non-overlapping subsets {ArH such that subset 
Ax contains all the symbols x that the metacode sends via code k. Then 


S= >> be ges (5.41) 


k EAk 
Now if one sub-code k satisfies the Kraft equality X- pe Ax 2—'(*) = 1, then it 
must be the case that 
yo ed (5.42) 


reA, 
with equality only if all the symbols x are in Ag, which would mean that we 
are only using one of the K codes. So 


1 K 


with equality only if equation (5.42) is an equality for all codes k. But it’s 
impossible for all the symbols to be in all the non-overlapping subsets LAr i 
so we can’t have equality (5.42) holding for all k. SoS < 1. 

Another way of seeing that a mixture code is suboptimal is to consider 
the binary tree that it defines. Think of the special case of two codes. The 
first bit we send identifies which code we are using. Now, in a complete code, 
any subsequent binary string is a valid string. But once we know that we 
are using, say, code A, we know that what follows can only be a codeword 
corresponding to a symbol x whose encoding is shorter under code A than 
code B. So some strings are invalid continuations, and the mixture code is 
incomplete and suboptimal. 

For further discussion of this issue and its relationship to probabilistic 
modelling read about ‘bits back coding’ in section 28.3 and in Frey (1998). 
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About Chapter 6 


Before reading Chapter 6, you should have read the previous chapter and 
worked on most of the exercises in it. 

We’ll also make use of some Bayesian modelling ideas that arrived in the 
vicinity of exercise 2.8 (p.30). 
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Stream Codes 


In this chapter we discuss two data compression schemes. 

Arithmetic coding is a beautiful method that goes hand in hand with the 
philosophy that compression of data from a source entails probabilistic mod- 
elling of that source. As of 1999, the best compression methods for text files 
use arithmetic coding, and several state-of-the-art image compression systems 
use it too. 

Lempel-Ziv coding is a ‘universal’ method, designed under the philosophy 
that we would like a single compression algorithm that will do a reasonable job 
for any source. In fact, for many real life sources, this algorithm’s universal 
properties hold only in the limit of unfeasibly large amounts of data, but, all 
the same, Lempel—Ziv compression is widely used and often effective. 


> 6.1 The guessing game 


As a motivation for these two compression methods, consider the redundancy 
in a typical English text file. Such files have redundancy at several levels: for 
example, they contain the ASCII characters with non-equal frequency; certain 
consecutive pairs of letters are more probable than others; and entire words 
can be predicted given the context and a semantic understanding of the text. 

To illustrate the redundancy of English, and a curious way in which it 
could be compressed, we can imagine a guessing game in which an English 
speaker repeatedly attempts to predict the next character in a text file. 

For simplicity, let us assume that the allowed alphabet consists of the 26 
upper case letters A,B,C,..., Z and a space ‘-’. The game involves asking 
the subject to guess the next character repeatedly, the only feedback being 
whether the guess is correct or not, until the character is correctly guessed. 
After a correct guess, we note the number of guesses that were made when 
the character was identified, and ask the subject to guess the next character 
in the same way. 

One sentence gave the following result when a human was asked to guess 
a sentence. The numbers of guesses are listed below each character. 


THERE-IS-NO-REVERSE-ON-A-MOTORCYCLE- 
Pi11s2121212212121511711121321227111214111121 


Notice that in many cases, the next letter is guessed immediately, in one 
guess. In other cases, particularly at the start of syllables, more guesses are 
needed. 

What do this game and these results offer us? First, they demonstrate the 
redundancy of English from the point of view of an English speaker. Second, 
this game might be used in a data compression scheme, as follows. 
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The string of numbers ‘1, 1, 1, 5, 1, ...’, listed above, was obtained by 
presenting the text to the subject. The maximum number of guesses that the 
subject will make for a given letter is twenty-seven, so what the subject is 
doing for us is performing a time-varying mapping of the twenty-seven letters 
{A,B,C,...,Z,—} onto the twenty-seven numbers {1,2,3,...,27}, which we 
can view as symbols in a new alphabet. The total number of symbols has not 
been reduced, but since he uses some of these symbols much more frequently 
than others — for example, 1 and 2 — it should be easy to compress this new 
string of symbols. 

How would the uncompression of the sequence of numbers ‘1, 1, 1, 5, 1,...’ 
work? At uncompression time, we do not have the original string ‘THERE. ..’, 
we have only the encoded sequence. Imagine that our subject has an absolutely 
identical twin who also plays the guessing game with us, as if we knew the 
source text. If we stop him whenever he has made a number of guesses equal to 
the given number, then he will have just guessed the correct letter, and we can 
then say ‘yes, that’s right’, and move to the next character. Alternatively, if 
the identical twin is not available, we could design a compression system with 
the help of just one human as follows. We choose a window length L, that is, 
a number of characters of context to show the human. For every one of the 
27" possible strings of length L, we ask them, ‘What would you predict is the 
next character?’, and ‘If that prediction were wrong, what would your next 
guesses be?’. After tabulating their answers to these 26 x 27/ questions, we 
could use two copies of these enormous tables at the encoder and the decoder 
in place of the two human twins. Such a language model is called an Lth order 
Markov model. 

These systems are clearly unrealistic for practical compression, but they 
illustrate several principles that we will make use of now. 


> 6.2 Arithmetic codes 


When we discussed variable-length symbol codes, and the optimal Huffman 
algorithm for constructing them, we concluded by pointing out two practical 
and theoretical problems with Huffman codes (section 5.6). 

These defects are rectified by arithmetic codes, which were invented by 
Elias, by Rissanen and by Pasco, and subsequently made practical by Witten 
et al. (1987). In an arithmetic code, the probabilistic modelling is clearly 
separated from the encoding operation. The system is rather similar to the 
guessing game. The human predictor is replaced by a probabilistic model of 
the source. As each symbol is produced by the source, the probabilistic model 
supplies a predictive distribution over all possible values of the next symbol, 
that is, a list of positive numbers {p;} that sum to one. If we choose to model 
the source as producing i.i.d. symbols with some known distribution, then the 
predictive distribution is the same every time; but arithmetic coding can with 
equal ease handle complex adaptive models that produce context-dependent 
predictive distributions. The predictive model is usually implemented in a 
computer program. 

The encoder makes use of the model’s predictions to create a binary string. 
The decoder makes use of an identical twin of the model (just as in the guessing 
game) to interpret the binary string. 

Let the source alphabet be Ax = {a1,...,a7}, and let the Ith symbol a; 
have the special meaning ‘end of transmission’. The source spits out a sequence 
%1,09,...,%p,.... The source does not necessarily produce i.i.d. symbols. We 
will assume that a computer program is provided to the encoder that assigns a 
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predictive probability distribution over a; given the sequence that has occurred 
thus far, P(£n =a; | £1,...,£n—-1). The receiver has an identical program that 
produces the same predictive probability distribution P(x, =a; |21,...,€n—1). 





0.00 

0.25 ——— 0 
=o. sine 

0.50 

0.75 —————— 1 

1.00 


Concepts for understanding arithmetic coding 


Notation for intervals. The interval [0.01, 0.10) is all numbers between 0.01 and 
0.10, including 0.010 = 0.01000... but not 0.100 = 0.10000.... 


A binary transmission defines an interval within the real line from 0 to 1. 
For example, the string 01 is interpreted as a binary real number 0.01..., which 
corresponds to the interval [0.01,0.10) in binary, i.e., the interval [0.25, 0.50) 
in base ten. 

The longer string 01101 corresponds to a smaller interval [0.01101, 
0.01110). Because 01101 has the first string, 01, as a prefix, the new in- 
terval is a sub-interval of the interval [0.01,0.10). A one-megabyte binary file 
(278 bits) is thus viewed as specifying a number between 0 and 1 to a precision 
of about two million decimal places — two million decimal digits, because each 
byte translates into a little more than two decimal digits. 


Now, we can also divide the real line {0,1) into J intervals of lengths equal 
to the probabilities P(x; = a;), as shown in figure 6.2. 


0.00 


ta 
P(a1=a1) aa} 
= a2 
— a245 
P(a1=a,) + P(x =a2) 
P(a,=a,)+ + P(x, =ü) tar 
1.0 


We may then take each interval a; and subdivide it into intervals de- 
noted ajaj,aja2,...,a;az, such that the length of ajaj is proportional to 
P(x =a; |a,=a,;). Indeed the length of the interval aja; will be precisely 
the joint probability 


P(x 





Qi, T2 aj) P(x ai) P(x2 =a; | £1 =a;). (6.1) 





Iterating this procedure, the interval [0,1) can be divided into a sequence 
of intervals corresponding to all possible finite length strings zıx2... £y, such 
that the length of an interval is equal to the probability of the string given 
our model. 
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Figure 6.1. Binary strings define 
real intervals within the real line 
[0,1). We first encountered a 
picture like this when we 
discussed the symbol-code 
supermarket in Chapter 5. 


Figure 6.2. A probabilistic model 
defines real intervals within the 
real line [0,1). 
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Algorithm 6.3. Arithmetic coding. 


0.0 Iterative procedure to find the 
1.0 interval [u, v) for the string 
i PIRS EN: 


p :=v-u 
for n=1to N4 
Compute the cumulative probabilities Q, and R, (6.2, 6.3) 


v := utpRy(an|©1,.--,;Un—1) 
u := U + PQn(En | T1,- -- , En—1) 
p:=u-u 











Formulae describing arithmetic coding 


The process depicted in figure 6.2 can be written explicitly as follows. The 
intervals are defined in terms of the lower and upper cumulative probabilities 


a1 

Qn(ai|@1,---,%n-1) = So P(En =a |21,- -, en); (6.2) 
v=1 

Rn(ai|@1,---;%r-1) = XO Pee a ed: (6.3) 
v=1 


As the nth symbol arrives, we subdivide the n—1th interval at the points defined 
by Qn and Rn. For example, starting with the first symbol, the intervals ‘ay’, 
‘ag’, and ‘ay’ are 


ai > [Q1(a1), Ri(a1)) = [0, P(x1 =a1)), (6.4) 
az > [Qi(a2), Ri(a2)) = [P(w@=a1), P(w@=a1) + P(x =a2)), (6.5) 

and 
ar > [Q1 (ar), Ri(az)) = [P(z1 =a) +... + P(a1 =ar), 1.0). (6.6) 


Algorithm 6.3 describes the general procedure. 


To encode a string x1£2...£y, we locate the interval corresponding to 
%12%2...%N, and send a binary string whose interval lies within that interval. 
This encoding can be performed on the fly, as we now illustrate. 


Example: compressing the tosses of a bent coin 


Imagine that we watch as a bent coin is tossed some number of times (cf. 
example 2.7 (p.30) and section 3.2 (p.51)). The two outcomes when the coin 
is tossed are denoted a and b. A third possibility is that the experiment is 
halted, an event denoted by the ‘end of file’ symbol, ‘@’. Because the coin is 
bent, we expect that the probabilities of the outcomes a and b are not equal, 
though beforehand we don’t know which is the more probable outcome. 














Encoding 











Let the source string be ‘bbball’. We pass along the string one symbol at a 
time and use our model to compute the probability distribution of the next 
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symbol given the string thus far. Let these probabilities be: 


Context 
(sequence thus far) Probability of next symbol 
P(a) =0.425 P(b) =0.425 P(Q)=0.15 
b P(a|b) =0.28 P(b|b) =0.57 P(O|b) =0.15 
bb P(a| bb) =0.21 P(b|bb) =0.64 P(G| bb) =0.15 
)= )= ) 
)= )= ) 
































bbb P(a|bbb)=0.17 P(b|bbb)=0.68 P(O |bbb)=0.15 
bbba P(a|bbba)=0.28 P(b|bbba)=0.57 P(O|bbba) =0.15 

















Figure 6.4 shows the corresponding intervals. The interval b is the middle 
0.425 of [0,1). The interval bb is the middle 0.567 of b, and so forth. 


= 00000 5000 Figure 6.4. Illustration of the 


= 00001 000 arithmetic coding process as the 
= 00010 5001 sequence bbba[ is transmitted. 
= 00011 00 
> 00100 
= 00101 9910 
> 00110 
a E 
= 00111 0014 
= 01000 
= 01001 9199 
> 01010 
= 01011 101 
> 01100 
= 01101 aaa 
= 01110 5111 
ba = 01111 - 10010111 
= 10000 - 10011000 


10001 190° bbbaa  — 10011001 


bba 10010 sane ————= 10011010 
a a ae 1004 = 10011011 
bbba = Io eT, bbba bbbab  =ī001100 _ 
b = 10100 = 10011101 
= 1010 = 10011101 
bb bbb bbbb = 10101 101 pbpad 340011110 
= 10110 2a 110011111 


bbb = 10111 J91 -\10100000 


bb = 11000 
T1001 119° 100111101 


b = EN 
= 11041 
= 11100 
= 11101 t10 
= 11110 
SEER ttt 























001 








010 





01 




















10011 




































































11 

















111 





When the first symbol ‘b’ is observed, the encoder knows that the encoded 
string will start ‘01’, ‘10’, or ‘11’, but does not know which. The encoder 
writes nothing for the time being, and examines the next symbol, which is ‘b’ 
The interval ‘bb’ lies wholly within interval ‘1’, so the encoder can write the 
first bit: ‘1’. The third symbol ‘b’ narrows down the interval a little, but not 
quite enough for it to lie wholly within interval ‘10’. Only when the next ‘a’ 
is read from the source can we transmit some more bits. Interval ‘bbba’ lies 
wholly within the interval ‘1001’, so the encoder adds ‘001’ to the ‘1’ it has 
written. Finally when the ‘D’ arrives, we need a procedure for terminating the 
encoding. Magnifying the interval ‘bbbaL)’ (figure 6.4, right) we note that the 
marked interval ‘100111101’ is wholly contained by bbbaQ, so the encoding 
can be completed by appending ‘11101’. 
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Exercise 6.1.1 P-!27] Show that the overhead required to terminate a message 
>= is never more than 2 bits, relative to the ideal message length given the 


probabilistic model H, h(x |H) = log[1/P(x|H)]. 


This is an important result. Arithmetic coding is very nearly optimal. The 
message length is always within two bits of the Shannon information content 
of the entire source string, so the expected message length is within two bits 
of the entropy of the entire message. 


Decoding 


The decoder receives the string ‘100111101’ and passes along it one symbol 
at a time. First, the probabilities P(a), P(b), P(O) are computed using the 
identical program that the encoder used and the intervals ‘a’, ‘b’ and ‘O’ are 
deduced. Once the first two bits ‘10’ have been examined, it is certain that 
the original string must have been started with a ‘b’, since the interval ‘10’ lies 
wholly within interval ‘b’. The decoder can then use the model to compute 
P(a|b), P(b |b), P(G|b) and deduce the boundaries of the intervals ‘ba’, ‘bb’ 
and ‘bU’. Continuing, we decode the second b once we reach ‘1001’, the third 
b once we reach ‘100111’, and so forth, with the unambiguous identification 
of ‘bbbal’ once the whole binary string has been read. With the convention 
that ‘O’ denotes the end of the message, the decoder knows to stop decoding. 















































Transmission of multiple files 


How might one use arithmetic coding to communicate several distinct files over 
the binary channel? Once the O character has been transmitted, we imagine 
that the decoder is reset into its initial state. There is no transfer of the learnt 
statistics of the first file to the second file. If, however, we did believe that 
there is a relationship among the files that we are going to compress, we could 
define our alphabet differently, introducing a second end-of-file character that 
marks the end of the file but instructs the encoder and decoder to continue 
using the same probabilistic model. 





The big picture 


Notice that to communicate a string of N letters both the encoder and the 
decoder needed to compute only N|A| conditional probabilities — the proba- 
bilities of each possible letter in each context actually encountered — just as in 
the guessing game. This cost can be contrasted with the alternative of using 
a Huffman code with a large block size (in order to reduce the possible one- 
bit-per-symbol overhead discussed in section 5.6), where all block sequences 
that could occur must be considered and their probabilities evaluated. 
Notice how flexible arithmetic coding is: it can be used with any source 
alphabet and any encoded alphabet. The size of the source alphabet and the 
encoded alphabet can change with time. Arithmetic coding can be used with 
any probability distribution, which can change utterly from context to context. 
Furthermore, if we would like the symbols of the encoding alphabet (say, 
O and 1) to be used with unequal frequency, that can easily be arranged by 
subdividing the right-hand interval in proportion to the required frequencies. 


How the probabilistic model might make its predictions 


The technique of arithmetic coding does not force one to produce the predic- 
tive probability in any particular way, but the predictive distributions might 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


116 6 — Stream Codes 





= 00000 Figure 6.5. Illustration of the 


00001 9999 intervals defined by a simple 
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= 90100 00 to the probability of the string. 

= 00101 0010 This model anticipates that the 
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naturally be produced by a Bayesian model. 

Figure 6.4 was generated using a simple model that always assigns a prob- 
ability of 0.15 to O, and assigns the remaining 0.85 to a and b, divided in 
proportion to probabilities given by Laplace’s rule, 





Fa+1 
P T E p SS 6.7 
(a|21,---)%n—1) F +R 4+? (6.7) 
where Fa(zx1,...,n—-1) is the number of times that a has occurred so far, and 


Fy is the count of bs. These predictions correspond to a simple Bayesian model 
that expects and adapts to a non-equal frequency of use of the source symbols 
a and b within a file. 

Figure 6.5 displays the intervals corresponding to a number of strings of 
length up to five. Note that if the string so far has contained a large number of 
bs then the probability of b relative to a is increased, and conversely if many 
as occur then as are made more probable. Larger intervals, remember, require 
fewer bits to encode. 


Details of the Bayesian model 


Having emphasized that any model could be used — arithmetic coding is not 
wedded to any particular set of probabilities — let me explain the simple adaptive 
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probabilistic model used in the preceding example; we first encountered this 
model in exercise 2.8 (p.30). 


Assumptions 


The model will be described using parameters po, pa and pp, defined below, 
which should not be confused with the predictive probabilities in a particular 
context, for example, P(a|s=baa). A bent coin labelled a and b is tossed some 
number of times l, which we don’t know beforehand. The coin’s probability of 
coming up a when tossed is pa, and pp = 1 — pa; the parameters pa, pp are not 
known beforehand. The source string s = baabaU indicates that l was 5 and 
the sequence of outcomes was baaba. 














1. It is assumed that the length of the string l has an exponential probability 














distribution 

P(l) = (1 — po)'po. (6.8) 
This distribution corresponds to assuming a constant probability po for 
the termination symbol ‘O’ at each character. 


2. It is assumed that the non-terminal characters in the string are selected in- 
dependently at random from an ensemble with probabilities P = {pa, pp}; 
the probability pa is fixed throughout the string to some unknown value 
that could be anywhere between 0 and 1. The probability of an a occur- 
ring as the next symbol, given pa (if only we knew it), is (1 — po)pa. The 
probability, given pa, that an unterminated string of length F is a given 
string s that contains { Fa, F} counts of the two outcomes is the Bernoulli 
distribution 


P(8| pa, F) = pẹ (1 — pa)*. (6.9) 
3. We assume a uniform prior distribution for pa, 
P(pa)=1, pa € [0,1], (6.10) 


and define pp = 1 — pa. It would be easy to assume other priors on pa, 
with beta distributions being the most convenient to handle. 


This model was studied in section 3.2. The key result we require is the predictive 
distribution for the next symbol, given the string so far, s. This probability 
that the next character is a or b (assuming that it is not ‘O’) was derived in 
equation (3.16) and is precisely Laplace’s rule (6.7). 














> Exercise 6.2.15] Compare the expected message length when an ASCII file is 
compressed by the following three methods. 


Huffman-with-header. Read the whole file, find the empirical fre- 
quency of each symbol, construct a Huffman code for those frequen- 
cies, transmit the code by transmitting the lengths of the Huffman 
codewords, then transmit the file using the Huffman code. (The 
actual codewords don’t need to be transmitted, since we can use a 
deterministic method for building the tree given the codelengths.) 


Arithmetic code using the Laplace model. 


Pay 


ot) = Se pay’ (6.11) 


P,(ala,.- 


Arithmetic code using a Dirichlet model. This model’s predic- 


tions are: 
Feta 


ma) = pay’ (6.12) 


Pp(a|z1,.. 
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where a is fixed to a number such as 0.01. A small value of a 
corresponds to a more responsive version of the Laplace model; 
the probability over characters is expected to be more nonuniform; 
a = 1 reproduces the Laplace model. 


Take care that the header of your Huffman message is self-delimiting. 
Special cases worth considering are (a) short files with just a few hundred 
characters; (b) large files in which some characters are never used. 


> 6.3 Further applications of arithmetic coding 


Efficient generation of random samples 


Arithmetic coding not only offers a way to compress strings believed to come 
from a given model; it also offers a way to generate random strings from a 
model. Imagine sticking a pin into the unit interval at random, that line 
having been divided into subintervals in proportion to probabilities p;; the 
probability that your pin will lie in interval i is p;. 

So to generate a sample from a model, all we need to do is feed ordinary 
random bits into an arithmetic decoder for that model. An infinite random 
bit sequence corresponds to the selection of a point at random from the line 
(0,1), so the decoder will then select a string at random from the assumed 
distribution. This arithmetic method is guaranteed to use very nearly the 
smallest number of random bits possible to make the selection — an important 
point in communities where random numbers are expensive! [This is not a joke. 
Large amounts of money are spent on generating random bits in software and 
hardware. Random numbers are valuable.] 

A simple example of the use of this technique is in the generation of random 
bits with a nonuniform distribution {po, py}. 


Exercise 6.3.12 P-128] Compare the following two techniques for generating 


~ random symbols from a nonuniform distribution {po, pi} = {0.99, 0.01}: 


(a) The standard method: use a standard random number generator 
to generate an integer between 1 and 232. Rescale the integer to 
(0,1). Test whether this uniformly distributed random variable is 
less than 0.99, and emit a 0 or 1 accordingly. 


(b) Arithmetic coding using the correct model, fed with standard ran- 
dom bits. 


Roughly how many random bits will each method use to generate a 
thousand samples from this sparse distribution? 


Efficient data-entry devices 


Compression: 
When we enter text into a computer, we make gestures of some sort — maybe text — bits 
we tap a keyboard, or scribble with a pointer, or click with a mouse; an 
efficient text entry system is one where the number of gestures required to Writing: 


enter a given text string is small. text << gestures 


Writing can be viewed as an inverse process to data compression. In data 
compression, the aim is to map a given text string into a small number of bits. 
In text entry, we want a small sequence of gestures to produce our intended 
text. 

By inverting an arithmetic coder, we can obtain an information-efficient 
text entry device that is driven by continuous pointing gestures (Ward et al., 
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2000). In this system, called Dasher, the user zooms in on the unit interval to 
locate the interval corresponding to their intended string, in the same style as 
figure 6.4. A language model (exactly as used in text compression) controls 
the sizes of the intervals such that probable strings are quick and easy to 
identify. After an hour’s practice, a novice user can write with one finger 
driving Dasher at about 25 words per minute — that’s about half their normal 
ten-finger typing speed on a regular keyboard. It’s even possible to write at 25 
words per minute, hands-free, using gaze direction to drive Dasher (Ward and 
MacKay, 2002). Dasher is available as free software for various platforms.! 


> 6.4 Lempel—Ziv coding 


The Lempel-Ziv algorithms, which are widely used for data compression (e.g., 
the compress and gzip commands), are different in philosophy to arithmetic 
coding. There is no separation between modelling and coding, and no oppor- 
tunity for explicit modelling. 


Basic Lempel—Ziv algorithm 


The method of compression is to replace a substring with a pointer to 
an earlier occurrence of the same substring. For example if the string is 
1011010100010..., we parse it into an ordered dictionary of substrings that 
have not appeared before as follows: A, 1, 0, 11, 01, 010, 00, 10, .... We in- 
clude the empty substring as the first substring in the dictionary and order 
the substrings in the dictionary by the order in which they emerged from the 
source. After every comma, we look along the next part of the input sequence 
until we have read a substring that has not been marked off before. A mo- 
ment’s reflection will confirm that this substring is longer by one bit than a 
substring that has occurred earlier in the dictionary. This means that we can 
encode each substring by giving a pointer to the earlier occurrence of that pre- 
fix and then sending the extra bit by which the new substring in the dictionary 
differs from the earlier substring. If, at the nth bit, we have enumerated s(n) 
substrings, then we can give the value of the pointer in [logs s(n)| bits. The 
code for the above sequence is then as shown in the fourth line of the following 
table (with punctuation included for clarity), the upper lines indicating the 
source string and the value of s(n): 


source substrings | À 1 0 11 01 010 00 10 

s(n) 0 1 2 3 4 5 6 7 
s(n)binary 000 001 010 011 100 101 110 111 
(pointer, bit) (,1) (0,0) (01,1) (10,1) (100,0) (010,0) (001,0) 


Notice that the first pointer we send is empty, because, given that there is 
only one substring in the dictionary — the string À — no bits are needed to 
convey the ‘choice’ of that substring as the prefix. The encoded string is 
100011101100001000010. The encoding, in this simple case, is actually a 
longer string than the source string, because there was no obvious redundancy 
in the source string. 


> Exercise 6.4./7] Prove that any uniquely decodeable code from {0,1}* to 
{0,1}* necessarily makes some strings longer if it makes some strings 
shorter. 


‘http://www. inference.phy.cam.ac.uk/dasher/ 
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One reason why the algorithm described above lengthens a lot of strings is 
because it is inefficient — it transmits unnecessary bits; to put it another way, 
its code is not complete. Once a substring in the dictionary has been joined 
there by both of its children, then we can be sure that it will not be needed 
(except possibly as part of our protocol for terminating a message); so at that 
point we could drop it from our dictionary of substrings and shuffle them 
all along one, thereby reducing the length of subsequent pointer messages. 
Equivalently, we could write the second prefix into the dictionary at the point 
previously occupied by the parent. A second unnecessary overhead is the 
transmission of the new bit in these cases — the second time a prefix is used, 
we can be sure of the identity of the next bit. 


Decoding 


The decoder again involves an identical twin at the decoding end who con- 
structs the dictionary of substrings as the data are decoded. 


> Exercise 6.5.[% P128] Encode the string 000000000000100000000000 using 
the basic Lempel—Ziv algorithm described above. 


[2, p.128] 


> Exercise 6.6. Decode the string 


00101011101100100100011010101000011 


that was encoded using the basic Lempel-Ziv algorithm. 


Practicalities 


In this description I have not discussed the method for terminating a string. 

There are many variations on the Lempel-Ziv algorithm, all exploiting the 
same idea but using different procedures for dictionary management, etc. The 
resulting programs are fast, but their performance on compression of English 
text, although useful, does not match the standards set in the arithmetic 
coding literature. 


Theoretical properties 


In contrast to the block code, Huffman code, and arithmetic coding methods 
we discussed in the last three chapters, the Lempel—Ziv algorithm is defined 
without making any mention of a probabilistic model for the source. Yet, given 
any ergodic source (i.e., one that is memoryless on sufficiently long timescales), 
the Lempel—Ziv algorithm can be proven asymptotically to compress down to 
the entropy of the source. This is why it is called a ‘universal’ compression 
algorithm. For a proof of this property, see Cover and Thomas (1991). 

It achieves its compression, however, only by memorizing substrings that 
have happened so that it has a short name for them the next time they occur. 
The asymptotic timescale on which this universal performance is achieved may, 
for many sources, be unfeasibly long, because the number of typical substrings 
that need memorizing may be enormous. The useful performance of the al- 
gorithm in practice is a reflection of the fact that many files contain multiple 
repetitions of particular short sequences of characters, a form of redundancy 
to which the algorithm is well suited. 
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Common ground 


I have emphasized the difference in philosophy behind arithmetic coding and 
Lempel-—Ziv coding. There is common ground between them, though: in prin- 
ciple, one can design adaptive probabilistic models, and thence arithmetic 
codes, that are ‘universal’, that is, models that will asymptotically compress 
any source in some class to within some factor (preferably 1) of its entropy. 
However, for practical purposes, I think such universal models can only be 
constructed if the class of sources is severely restricted. A general purpose 
compressor that can discover the probability distribution of any source would 
be a general purpose artificial intelligence! A general purpose artificial intelli- 
gence does not yet exist. 


> 6.5 Demonstration 


An interactive aid for exploring arithmetic coding, dasher.tc1, is available. ? 

A demonstration arithmetic-coding software package written by Radford 
Neal? consists of encoding and decoding modules to which the user adds a 
module defining the probabilistic model. It should be emphasized that there 
is no single general-purpose arithmetic-coding compressor; a new model has to 
be written for each type of source. Radford Neal’s package includes a simple 
adaptive model similar to the Bayesian model demonstrated in section 6.2. 
The results using this Laplace model should be viewed as a basic benchmark 
since it is the simplest possible probabilistic model — it simply assumes the 
characters in the file come independently from a fixed ensemble. The counts 
{F;} of the symbols {a;} are rescaled and rounded as the file is read such that 
all the counts lie between 1 and 256. 

A state-of-the-art compressor for documents containing text and images, 
DjVu, uses arithmetic coding.’ It uses a carefully designed approximate arith- 
metic coder for binary alphabets called the Z-coder (Bottou et al., 1998), which 
is much faster than the arithmetic coding software described above. One of 
the neat tricks the Z-coder uses is this: the adaptive model adapts only occa- 
sionally (to save on computer time), with the decision about when to adapt 
being pseudo-randomly controlled by whether the arithmetic encoder emitted 
a bit. 

The JBIG image compression standard for binary images uses arithmetic 
coding with a context-dependent model, which adapts using a rule similar to 
Laplace’s rule. PPM (Teahan, 1995) is a leading method for text compression, 
and it uses arithmetic coding. 

There are many Lempel-Ziv-based programs. gzip is based on a version 
of Lempel-Ziv called ‘LZ77’ (Ziv and Lempel, 1977). compress is based on 
‘LZW’ (Welch, 1984). In my experience the best is gzip, with compress being 
inferior on most files. 

bzip is a block-sorting file compressor, which makes use of a neat hack 
called the Burrows—Wheeler transform (Burrows and Wheeler, 1994). This 
method is not based on an explicit probabilistic model, and it only works well 
for files larger than several thousand characters; but in practice it is a very 
effective compressor for files in which the context of a character is a good 
predictor for that character.” 





*nttp://www.inference.phy.cam.ac.uk/mackay/itprnn/softwarel .html 

3£tp://ftp.cs.toronto.edu/pub/radford/www/ac.software. html 

‘http: //www.djvuzone.org/ 

5There is a lot of information about the Burrows-Wheeler transform on the net. 
http: //dogma.net/DataCompression/BWT. shtml 
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Compression of a text file 


Table 6.6 gives the computer time in seconds taken and the compression 
achieved when these programs are applied to the I‘TRX file containing the 
text of this chapter, of size 20,942 bytes. 


Table 6.6. Comparison of 


Method Compression Compressed size Uncompression i oih idi 
A : compression algorithms applied to 
time/sec (%age of 20,942) time / sec ee fle. g PR 

Laplace model 0.28 12974 (61%) 0.32 

gzip 0.10 8177 (39%) 0.01 

compress 0.05 10816 (51%) 0.05 

bzip 7495 (36%) 

bzip2 7640 (36%) 

ppmz 6800 (32%) 


Compression of a sparse file 


Interestingly, gzip does not always do so well. Table 6.7 gives the compres- 
sion achieved when these programs are applied to a text file containing 10° 
characters, each of which is either O and 1 with probabilities 0.99 and 0.01. 
The Laplace model is quite well matched to this source, and the benchmark 
arithmetic coder gives good performance, followed closely by compress; gzip 
is worst. An ideal model for this source would compress the file into about 
10° H2(0.01)/8 ~ 10100 bytes. The Laplace-model compressor falls short of 
this performance because it is implemented using only eight-bit precision. The 
ppmz compressor compresses the best of all, but takes much more computer 


time. 
Method Compression Compressed size Uncompression Table 6.7. Comparison of : 
time / sec / bytes time / sec compression algorithms applied to 
a random file of 10° characters, 

Laplace model 0.45 14143 (1.4%) 0.57 99% Os and 1% 1s. 

gzip 0.22 20646 (2.1%) 0.04 

gzip --best+ 1.63 15 553 (1.6%) 0.05 

compress 0.13 14785 (1.5%) 0.03 

bzip 0.30 10 903 (1.09%) 0.17 

bzip2 0.19 11260 (1.12%) 0.05 

ppmz 533 10447 (1.04%) 535 


> 6.6 Summary 


In the last three chapters we have studied three classes of data compression 
codes. 


Fixed-length block codes (Chapter 4). These are mappings from a fixed 
number of source symbols to a fixed-length binary message. Only a tiny 
fraction of the source strings are given an encoding. These codes were 
fun for identifying the entropy as the measure of compressibility but they 
are of little practical use. 
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Symbol codes (Chapter 5). Symbol codes employ a variable-length code for 
each symbol in the source alphabet, the codelengths being integer lengths 
determined by the probabilities of the symbols. Huffman’s algorithm 
constructs an optimal symbol code for a given set of symbol probabilities. 


Every source string has a uniquely decodeable encoding, and if the source 
symbols come from the assumed distribution then the symbol code will 
compress to an expected length per character L lying in the interval 
[H,H +1). Statistical fluctuations in the source may make the actual 
length longer or shorter than this mean length. 


If the source is not well matched to the assumed distribution then the 
mean length is increased by the relative entropy Dx, between the source 
distribution and the code’s implicit distribution. For sources with small 
entropy, the symbol has to emit at least one bit per source symbol; 
compression below one bit per source symbol can be achieved only by 
the cumbersome procedure of putting the source data into blocks. 


Stream codes. The distinctive property of stream codes, compared with 
symbol codes, is that they are not constrained to emit at least one bit for 
every symbol read from the source stream. So large numbers of source 
symbols may be coded into a smaller number of bits. This property 
could be obtained using a symbol code only if the source stream were 
somehow chopped into blocks. 


e Arithmetic codes combine a probabilistic model with an encoding 
algorithm that identifies each string with a sub-interval of [0,1) of 
size equal to the probability of that string under the model. This 
code is almost optimal in the sense that the compressed length of a 
string x closely matches the Shannon information content of x given 
the probabilistic model. Arithmetic codes fit with the philosophy 
that good compression requires data modelling, in the form of an 
adaptive Bayesian model. 


e Lempel-—Ziv codes are adaptive in the sense that they memorize 
strings that have already occurred. They are built on the philoso- 
phy that we don’t know anything at all about what the probability 
distribution of the source will be, and we want a compression algo- 
rithm that will perform reasonably well whatever that distribution 
is. 

Both arithmetic codes and Lempel—Ziv codes will fail to decode correctly 
if any of the bits of the compressed file are altered. So if compressed files are 
to be stored or transmitted over noisy media, error-correcting codes will be 
essential. Reliable communication over unreliable channels is the topic of Part 


II. 


> 6.7 Exercises on stream codes 


"Exercise 6.7.1] Describe an arithmetic coding algorithm to encode random bit 
> strings of length N and weight K (i.e., K ones and N — K zeroes) where 
N and K are given. 


For the case N=5, K =2, show in detail the intervals corresponding to 
all source substrings of lengths 1-5. 


> Exercise 6.8.12 p- 128] How many bits are needed to specify a selection of K 


objects from N objects? (N and K are assumed to be known and the 
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selection of K objects is unordered.) How might such a selection be 
made at random without being wasteful of random bits? 


> Exercise 6.9.1] A binary source X emits independent identically distributed 
symbols with probability distribution {fo, fi}, where fı = 0.01. Find 
an optimal uniquely-decodeable symbol code for a string x = £1£2£3 of 
three successive samples from this source. 


Estimate (to one decimal place) the factor by which the expected length 
of this optimal code is greater than the entropy of the three-bit string x. 


[H2(0.01) ~ 0.08, where Hə(x) = xlogs(1/x) + (1 — x) loga(1/(1 — x)).] 


An arithmetic code is used to compress a string of 1000 samples from 
the source X. Estimate the mean and standard deviation of the length 
of the compressed file. 


> Exercise 6.10.!?! Describe an arithmetic coding algorithm to generate random 
bit strings of length N with density f (i.e., each bit has probability f of 
being a one) where N is given. 


Exercise 6.11.!7] Use a modified Lempel-Ziv algorithm in which, as discussed 
on p.120, the dictionary of prefixes is pruned by writing new prefixes 
into the space occupied by prefixes that will not be needed again. 
Such prefixes can be identified when both their children have been 
added to the dictionary of prefixes. (You may neglect the issue of 
termination of encoding.) Use this algorithm to encode the string 
0100001000100010101000001. Highlight the bits that follow a prefix 
on the second occasion that that prefix is used. (As discussed earlier, 
these bits could be omitted.) 


Exercise 6.12.1 P-128] Show that this modified Lempel-Ziv code is still not 
‘complete’, that is, there are binary strings that are not encodings of 
any string. 


> Exercise 6.13.1% P128] Give examples of simple sources that have low entropy 
but would not be compressed well by the Lempel-Ziv algorithm. 


> 6.8 Further exercises on data compression 


The following exercises may be skipped by the reader who is eager to learn 
about noisy channels. 


Exercise 6.14.[% P-130] Consider a Gaussian distribution in N dimensions, 


din a l (6.13) 


~ (2ro2)N/2 ve (- 20? 





P(x) 


Define the radius of a point x to be r = (Xn £2)” . Estimate the mean 
and variance of the square of the radius, r? =D Te) 


You may find helpful the integral 


1 4 r? 4 


though you should be able to estimate the required quantities without it. 
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probability density 


Fa is maximized here 


almost all 
probability mass is here 


Assuming that N is large, show that nearly all the probability of a 
Gaussian is contained in a thin shell of radius VNo. Find the thickness 
of the shell. 


Evaluate the probability density (6.13) at a point in that thin shell and 
at the origin x = 0 and compare. Use the case N = 1000 as an example. 





Notice that nearly all the probability mass is located in a different part 
of the space from the region of highest probability density. Figure 6.8. Schematic 
representation of the typical set of 
an N-dimensional Gaussian 


Ul . [2] ; : 
“ - Exercise 6.15. Explain what is meant by an optimal binary symbol code. distribution. 


= 


Find an optimal binary symbol code for the ensemble: 
A = {a, b, c,d, e,f,g,h, ij}, 


1 2 4 5 6 8 9 10 25 30 


P — ’ ? , zi , , ? ? ’ , 
{ 100 100 100 100° 100° 100 100 100° 100° 100 \ 
and compute the expected length of the code. 


Exercise 6.16.!7] A string y = 2122 consists of two independent samples from 
> an ensemble 


1 3 6 
X s = b : z a ae a F 
Ax {a, c]; Px lou 
What is the entropy of y? Construct an optimal binary symbol code for 
the string y, and find its expected length. 


Exercise 6.17.17] Strings of N independent samples from an ensemble with 

= P = {0.1,0.9} are compressed using an arithmetic code that is matched 

to that ensemble. Estimate the mean and standard deviation of the 
compressed strings’ lengths for the case N = 1000. [H2(0.1) ~ 0.47] 


Exercise 6.18.13] Source coding with variable-length symbols. 


In the chapters on source coding, we assumed that we were 
encoding into a binary alphabet {0,1} in which both symbols 
should be used with equal frequency. In this question we ex- 
plore how the encoding alphabet should be used if the symbols 
take different times to transmit. 


A poverty-stricken student communicates for free with a friend using a 
telephone by selecting an integer n € {1,2,3...}, making the friend’s 
phone ring n times, then hanging up in the middle of the nth ring. This 
process is repeated so that a string of symbols nyngn3... is received. 
What is the optimal way to communicate? If large integers n are selected 
then the message takes longer to communicate. If only small integers n 
are used then the information content per symbol is small. We aim to 
maximize the rate of information transfer, per unit time. 


Assume that the time taken to transmit a number of rings n and to 
redial is J, seconds. Consider a probability distribution over n, {pn}. 
Defining the average duration per symbol to be 


L(p) = 5 prin (6.15) 
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and the entropy per symbol to be 
1 
H(p) = X` pn logy on (6.16) 


show that for the average information rate per second to be maximized, 
the symbols must be used with probabilities of the form 
1 


obs (6.17) 


Pn =F 


where Z =o, 2-%ln and 3 satisfies the implicit equation 


_ H(p) 
B= Ley’ (6.18) 


that is, 6 is the rate of communication. Show that these two equations 
(6.17, 6.18) imply that 6 must be set such that 


log Z = 0. (6.19) 

Assuming that the channel has the property 
ln = n seconds, (6.20) 
find the optimal distribution p and show that the maximal information 


rate is 1 bit per second. 


How does this compare with the information rate per second achieved if 
p is set to (1/2, 1/2,0,0,0,0,...) — that is, only the symbols n = 1 and 
n = 2 are selected, and they have equal probability? 


Discuss the relationship between the results (6.17, 6.19) derived above, 
and the Kraft inequality from source coding theory. 


How might a random binary source be efficiently encoded into a se- 
quence of symbols ninong... for transmission over the channel defined 
in equation (6.20)? 


> Exercise 6.19.!4] How many bits does it take to shuffle a pack of cards? 


> Exercise 6.20.!] In the card game Bridge, the four players receive 13 cards 


each from the deck of 52 and start each game by looking at their own 
hand and bidding. The legal bids are, in ascending order 1@,1,19,1@, 
INT, 2&, 2, ... 70,7@,7NT, and successive bids must follow this 
order; a bid of, say, 29 may only be followed by higher bids such as 2@ 
or 3% or 7NT. (Let us neglect the ‘double’ bid.) 


The players have several aims when bidding. One of the aims is for two 
partners to communicate to each other as much as possible about what 
cards are in their hands. 


Let us concentrate on this task. 


(a) After the cards have been dealt, how many bits are needed for North 
to convey to South what her hand is? 


(b) Assuming that E and W do not bid at all, what is the maximum 
total information that N and S can convey to each other while 
bidding? Assume that N starts the bidding, and that once either 
N or S stops bidding, the bidding stops. 
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> Exercise 6.21.!?! My old ‘arabic’ microwave oven had 11 buttons for entering 
cooking times, and my new ‘roman’ microwave has just five. The but- 
tons of the roman microwave are labelled ‘10 minutes’, ‘1 minute’, ‘10 
seconds’, ‘1 second’, and ‘Start’; ll abbreviate these five strings to the 
symbols M, C, X, 1, 0. To enter one minute and twenty-three seconds 
(1:23), the arabic sequence is 








1230, (6.21) 


and the roman sequence is 











CXXIIIO. (6.22) 





Each of these keypads defines a code mapping the 3599 cooking times 
from 0:01 to 59:59 into a string of symbols. 


(a) Which times can be produced with two or three symbols? (For 
example, 0:20 can be produced by three symbols in either code: 
XXO and 200.) 

















(b) Are the two codes complete? Give a detailed answer. 


(c) For each code, name a cooking time that it can produce in four 
symbols that the other code cannot. 


(d) Discuss the implicit probability distributions over times to which 
each of these codes is best matched. 


(e) Concoct a plausible probability distribution over times that a real 
user might use, and evaluate roughly the expected number of sym- 
bols, and maximum number of symbols, that each code requires. 
Discuss the ways in which each code is inefficient or efficient. 


(f) Invent a more efficient cooking-time-encoding system for a mi- 
crowave oven. 


Exercise 6.22.7 P-132] Ts the standard binary representation for positive inte- 
gers (e.g. ch(5) = 101) a uniquely decodeable code? 


Design a binary code for the positive integers, i.e., a mapping from 
n € {1,2,3,...} to e(n) € {0,1}*, that is uniquely decodeable. Try 
to design codes that are prefix codes and that satisfy the Kraft equality 


D, 2h =1. 


Motivations: any data file terminated by a special end of file character 
can be mapped onto an integer, so a prefix code for integers can be used 
as a self-delimiting encoding of files too. Large files correspond to large 
integers. Also, one of the building blocks of a ‘universal’ coding scheme — 
that is, a coding scheme that will work OK for a large variety of sources 
— is the ability to encode integers. Finally, in microwave ovens, cooking 
times are positive integers! 


Discuss criteria by which one might compare alternative codes for inte- 
gers (or, equivalently, alternative self-delimiting codes for files). 


> 6.9 Solutions 


Solution to exercise 6.1 (p.115). The worst-case situation is when the interval 
to be represented lies just inside a binary interval. In this case, we may choose 
either of two binary intervals as shown in figure 6.10. These binary intervals 
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Figure 6.9. Alternative keypads 
for microwave ovens. 
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Figure 6.10. Termination of 
Source string’s interval Binary intervals arithmetic coding in the worst 
case, where there is a two bit 
overhead. Either of the two 
binary intervals marked on the 
right-hand side may be chosen. 
These binary intervals are no 
Fe smaller than P(x|H)/4. 


are no smaller than P(x|H)/4, so the binary encoding has a length no greater 
than logs 1/P(x|H) + logy 4, which is two bits more than the ideal message 
length. 


Solution to exercise 6.3 (p.118). The standard method uses 32 random bits 
per generated symbol and so requires 32000 bits to generate one thousand 
samples. 

Arithmetic coding uses on average about H (0.01) = 0.081 bits per gener- 
ated symbol, and so requires about 83 bits to generate one thousand samples 
(assuming an overhead of roughly two bits associated with termination). 

Fluctuations in the number of 1s would produce variations around this 
mean with standard deviation 21. 


Solution to exercise 6.5 (p.120). The encoding is 010100110010110001100, 
which comes from the parsing 


0, 00, 000, 0000, 001, 00000, 000000 (6.23) 
which is encoded thus: 
(,0), (1,0), (10,0), (11,0), (010, 1), (100, 0), (110, 0). (6.24) 


Solution to exercise 6.6 (p.120). The decoding is 
0100001000100010101000001. 


Solution to exercise 6.8 (p.123). This problem is equivalent to exercise 6.7 
(p. 123). 

The selection of K objects from N objects requires [logs PA bits ~ 
NHə(K/N) bits. This selection could be made using arithmetic coding. The 
selection corresponds to a binary string of length N in which the 1 bits rep- 
resent which objects are selected. Initially the probability of a 1 is K/N and 
the probability of a 0 is (N—K)/N. Thereafter, given that the emitted string 
thus far, of length n, contains k 1s, the probability of a 1 is (K—k)/(N-=n) 
and the probability of a 0 is 1 — (K—k)/(N—n). 


Solution to exercise 6.12 (p.124). This modified Lempel-Ziv code is still not 
‘complete’, because, for example, after five prefixes have been collected, the 
pointer could be any of the strings 000, 001, 010, 011, 100, but it cannot be 
101, 110 or 111. Thus there are some binary strings that cannot be produced 
as encodings. 


Solution to exercise 6.13 (p.124). Sources with low entropy that are not well 
compressed by Lempel—Ziv include: 
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6.9: Solutions 


(a) Sources with some symbols that have long range correlations and inter- 


Figure 6.11. A source with low entropy that is not well compressed by Lempel—Ziv. The bit sequence 
is read from left to right. Each line differs from the line above in f = 5% of its bits. The 





vening random junk. An ideal model should capture what’s correlated 
and compress it. Lempel—Ziv can compress the correlated features only 
by memorizing all cases of the intervening junk. As a simple example, 
consider a telephone book in which every line contains an (old number, 
new number) pair: 





285-3820 : 572-5892 

258-8302 : 593-2010 
The number of characters per line is 18, drawn from the 13-character 
alphabet {0,1,...,9,—,:, A}. The characters ‘~’, ‘:’ and ‘D’ occur in a 
predictable sequence, so the true information content per line, assuming 
all the phone numbers are seven digits long, and assuming that they are 
random sequences, is about 14 bans. (A ban is the information content of 
a random integer between 0 and 9.) A finite state language model could 
easily capture the regularities in these data. A Lempel-Ziv algorithm 
will take a long time before it compresses such a file down to 14 bans 
per line, however, because in order for it to ‘learn’ that the string :ddd 
is always followed by -, for any three digits ddd, it will have to see all 
those strings. So near-optimal compression will only be achieved after 


thousands of lines of the file have been read. 
angry i 
Wl 












































re 


. Hk | 





image width is 400 pixels. 


(b) Sources with long range correlations, for example two-dimensional im- 


ages that are represented by a sequence of pixels, row by row, so that 
vertically adjacent pixels are a distance w apart in the source stream, 
where w is the image width. Consider, for example, a fax transmission in 
which each line is very similar to the previous line (figure 6.11). The true 
entropy is only Ho(f) per pixel, where f is the probability that a pixel 
differs from its parent. Lempel-Ziv algorithms will only compress down 
to the entropy once all strings of length 2” = 24°° have occurred and 
their successors have been memorized. There are only about 2300 par- 
ticles in the universe, so we can confidently say that Lempel—Ziv codes 
will never capture the redundancy of such an image. 


Another highly redundant texture is shown in figure 6.12. The image was 
made by dropping horizontal and vertical pins randomly on the plane. It 
contains both long-range vertical correlations and long-range horizontal 
correlations. There is no practical way that Lempel—Ziv, fed with a 
pixel-by-pixel scan of this image, could capture both these correlations. 


Biological computational systems can readily identify the redundancy in 
these images and in images that are much more complex; thus we might 
anticipate that the best data compression algorithms will result from the 
development of artificial intelligence methods. 
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Figure 6.12. A texture consisting of horizontal and vertical pins dropped at random on the plane. 


(c) 


Sources with intricate redundancy, such as files generated by computers. 
For example, a IATEX file followed by its encoding into a PostScript 
file. The information content of this pair of files is roughly equal to the 
information content of the IATẸEX file alone. 


A picture of the Mandelbrot set. The picture has an information content 
equal to the number of bits required to specify the range of the complex 
plane studied, the pixel sizes, and the colouring rule used. 


A picture of a ground state of a frustrated antiferromagnetic Ising model 
(figure 6.13), which we will discuss in Chapter 31. Like figure 6.12, this 
binary image has interesting correlations in two directions. 





Cellular automata — figure 6.14 shows the state history of 100 steps of 
a cellular automaton with 400 cells. The update rule, in which each 
cell’s new state depends on the state of five preceding cells, was selected 
at random. The information content is equal to the information in the 
boundary (400 bits), and the propagation rule, which here can be de- 
scribed in 32 bits. An optimal compressor will thus give a compressed file 
length which is essentially constant, independent of the vertical height of 
the image. Lempel—Ziv would only give this zero-cost compression once 
the cellular automaton has entered a periodic limit cycle, which could 
easily take about 2!° iterations. 

In contrast, the JBIG compression method, which models the probability 
of a pixel given its local context and uses arithmetic coding, would do a 
good job on these images. 


Solution to exercise 6.14 (p.124). For a one-dimensional Gaussian, the vari- 


ance of z, E[x?], is o 


2. So the mean value of r° in N dimensions, since the 


components of x are independent random variables, is 


E[r?] = No?. (6.25) 


Figure 6.13. Frustrated triangular 
Ising model in one of its ground 
states. 
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Figure 6.14. The 100-step time-history of a cellular automaton with 400 cells. 


The variance of r?, similarly, is N times the variance of «?, where g is a 
one-dimensional Gaussian variable. 


1 x? 
var (x?) = fæ roa” exp (-5) = gt. (6.26) 


The integral is found to be 304 (equation (6.14)), so var(x?) = 20%. Thus the 
variance of r° is 2No*. 

For large N, the central-limit theorem indicates that r? has a Gaussian 
distribution with mean No? and standard deviation /2No7, so the probability 
density of r must similarly be concentrated about r ~ v No. 

The thickness of this shell is given by turning the standard deviation 
of r? into a standard deviation on r: for small dr/r, dlogr = ôr/r = 
(1/2)6 log r? = (1/2)5(r?)/r?, so setting 6(r?) = V2No?, r has standard de- 
viation ôr = (1/2)r6(r?) /r? = o / V2. 

The probability density of the Gaussian at a point Xshe where r = VNo 
is 


1 No? 1 N 
P(Xshen) = Drop XP (-=) = Oro?) N P (-7) : (6.27) 
Whereas the probability density at the origin is 


1 


2)N/2` 


PSN) (270 


(6.28) 
Thus P(Xshen)/P(x =0) = exp (—N/2) . The probability density at the typical 
radius is e~/? times smaller than the density at the origin. If N = 1000, then 
the probability density at the origin is e5% times greater. 
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Codes for Integers 


This chapter is an aside, which may safely be skipped. 


Solution to exercise 6.22 (p.127) 


To discuss the coding of integers we need some definitions. 


The standard binary representation of a positive integer n will be 
denoted by cp(n), e.g., ch(5) = 101, cp(45) = 101101. 


The standard binary length of a positive integer n, 


Ip(n), is the 
length of the string c,(n). For example, lp(5) = 3, [)(45) = 


6. 


The standard binary representation c,(n) is not a uniquely decodeable code 
for integers since there is no way of knowing when an integer has ended. For 
example, cp(5)cp(5) is identical to cp(45). It would be uniquely decodeable if 
we knew the standard binary length of each integer before it was received. 

Noticing that all positive integers have a standard binary representation 
that starts with a 1, we might define another representation: 


The headless binary representation of a positive integer n will be de- 
noted by cp(n), e.g., cg(5) = 01, cg(45) = 01101 and cg(1) = A (where 
A denotes the null string). 


This representation would be uniquely decodeable if we knew the length lp (n) 
of the integer. 

So, how can we make a uniquely decodeable code for integers? Two strate- 
gies can be distinguished. 


1. Self-delimiting codes. We first communicate somehow the length of 
the integer, lp(n), which is also a positive integer; then communicate the 
original integer n itself using cg(n). 


2. Codes with ‘end of file’ characters. We code the integer into blocks 
of length b bits, and reserve one of the 2° symbols to have the special 
meaning ‘end of file’. The coding of integers into blocks is arranged so 
that this reserved symbol is not needed for any other purpose. 


The simplest uniquely decodeable code for integers is the unary code, which 
can be viewed as a code with an end of file character. 
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Unary code. An integer n is encoded by sending a string of n—1 Os followed 
by a 1. 


cu(n) 
1 

01 
001 
0001 
00001 


oP WNrH IS 


45 000000000000000000000000000000000000000000001 


The unary code has length y(n) = n. 


The unary code is the optimal code for integers if the probability distri- 
bution over n is py(n) = 27”. 





n  cp(n) bln) ca(n) 
t i 17 
Self-delimiting codes 2 10 2 O10 
3 11 2 Of1 
We can use the unary code to encode the length of the binary encoding of n 4 100 3 00100 
and make a self-delimiting code: 5 104 3 00101 
6 110 3 00110 
Code Ca. We send the unary code for lp(n), followed by the headless binary 
representation of n. iE 4 
101101 00000101101 
Co(n) = cuf (n)]cg(n). (7.1) 


Table 7.1 shows the codes for some integers. The overlining indicates 

the division of each string into the parts cy[l,(n)] and cg(n). We might Table 7.1. Ca. 
equivalently view cq(n) as consisting of a string of (lp(n) — 1) zeroes 

followed by the standard binary representation of n, cp(n). 


The codeword ca(n) has length laln) = 2 (n) — 1. 


The implicit probability distribution over n for the code Ca is separable 
into the product of a probability distribution over the length 1, 


P0) =27, (7.2) n celin) c(n) 
and a uniform distribution over integers having that length, 1 i I 
aa 2 0100 01000 
27 lp(n)=1 3 0101 01001 
P(n |1) = { 0 óthierwise: (7.3) 4 01100 010100 
5 01101 010101 
6 01110 010110 


Now, for the above code, the header that communicates the length always 
occupies the same number of bits as the standard binary representation of 
the integer (give or take one). If we are expecting to encounter large integers 45 0011001101 0111001101 
(large files) then this representation seems suboptimal, since it leads to all files 
occupying a size that is double their original uncoded size. Instead of using 
the unary code to encode the length h(n), we could use Ca. Table 7.2. Cg and Cy. 





Code Cg. We send the length p(n) using Ca, followed by the headless binary 
representation of n. 


ca(n) = callp(n)}en(m). (7.4) 
Iterating this procedure, we can define a sequence of codes. 
Code C}. 

cy(n) = celh (n)]cg (n). (7.5) 
Code C5. 


cs(n) = cll(n)]ce(n). (7.6) 
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Codes with end-of-file symbols 


We can also make byte-based representations. (Let’s use the term byte flexibly 
here, to denote any fixed-length string of bits, not just a string of length 8 
bits.) If we encode the number in some base, for example decimal, then we 
can represent each digit in a byte. In order to represent a digit from 0 to 9 in a 
byte we need four bits. Because 24 = 16, this leaves 6 extra four-bit symbols, 
{1010, 1011, 1100, 1101, 1110, 1111}, that correspond to no decimal digit. 
We can use these as end-of-file symbols to indicate the end of our positive 
integer. 


Clearly it is redundant to have more than one end-of-file symbol, so a more 


efficient code would encode the integer into base 15, and use just the sixteenth 
symbol, 1111, as the punctuation character. Generalizing this idea, we can 
make similar byte-based codes for integers in bases 3 and 7, and in any base 
of the form 2” — 1. 


These codes are almost complete. (Recall that a code is ‘complete’ if it 


satisfies the Kraft inequality with equality.) The codes’ remaining inefficiency 
is that they provide the ability to encode the integer zero and the empty string, 
neither of which was required. 


> Exercise 7.1. [2 


> P-136] Consider the implicit probability distribution over inte- 


gers corresponding to the code with an end-of-file character. 


(a) If the code has eight-bit blocks (i.e., the integer is coded in base 
255), what is the mean length in bits of the integer, under the 
implicit distribution? 

(b) If one wishes to encode binary files of expected size about one hun- 
dred kilobytes using a code with an end-of-file character, what is 
the optimal block size? 


Encoding a tiny file 


To illustrate the codes we have discussed, we now use each code to encode a 
small file consisting of just 14 characters, 


Claude Shannon |. 


e If we map the ASCII characters onto seven-bit symbols (e.g., in decimal, 


C = 67, 1 = 108, etc.), this 14 character file corresponds to the integer 


n = 167 987 786 364 950 891 085 602 469 870 (decimal). 


The unary code for n consists of this many (less one) zeroes, followed by 
a one. If all the oceans were turned into ink, and if we wrote a hundred 
bits with every cubic millimeter, there might be enough ink to write 


cu(n). 


The standard binary representation of n is this length-98 sequence of 
bits: 


cp(n) = 1000011110110011000011110101110010011001010100000 
1010011110100011000011101110110111011011111101110. 


> Exercise 7.2.7] Write down or describe the following self-delimiting represen- 


tations of the above number n: ca(n), ca(n), cy(n), cafn), c3(n), e7(n), 
and cj5(n). Which of these encodings is the shortest? [Answer: c15.] 


7 — Codes for Integers 


n cln) C7(n) 

1 0111 001111 
2 1011 010111 
3 010011 011111 


45 0110000011 1100411111 


Table 7.3. Two codes with 
end-of-file symbols, C3 and C7. 
Spaces have been included to 
show the byte boundaries. 
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Comparing the codes 


One could answer the question ‘which of two codes is superior?’ by a sentence 
of the form ‘For n > k, code 1 is superior, for n < k, code 2 is superior’ but I 
contend that such an answer misses the point: any complete code corresponds 
to a prior for which it is optimal; you should not say that any other code is 
superior to it. Other codes are optimal for other priors. These implicit priors 
should be thought about so as to achieve the best code for one’s application. 


Notice that one cannot, for free, switch from one code to another, choosing 
whichever is shorter. If one were to do this, then it would be necessary to 
lengthen the message in some way that indicates which of the two codes is 
being used. If this is done by a single leading bit, it will be found that the 
resulting code is suboptimal because it fails the Kraft equality, as was discussed 
in exercise 5.33 (p.104). 


Another way to compare codes for integers is to consider a sequence of 
probability distributions, such as monotonic probability distributions over n > 
1, and rank the codes as to how well they encode any of these distributions. 
A code is called a ‘universal’ code if for any distribution in a given class, it 
encodes into an average length that is within some factor of the ideal average 
length. 


Let me say this again. We are meeting an alternative world view — rather 
than figuring out a good prior over integers, as advocated above, many the- 
orists have studied the problem of creating codes that are reasonably good 
codes for any priors in a broad class. Here the class of priors convention- 
ally considered is the set of priors that (a) assign a monotonically decreasing 
probability over integers and (b) have finite entropy. 


Several of the codes we have discussed above are universal. Another code 
which elegantly transcends the sequence of self-delimiting codes is Elias’s ‘uni- 
versal code for integers’ (Elias, 1975), which effectively chooses from all the 
codes Ca, Cg,.... It works by sending a sequence of messages each of which 
encodes the length of the next message, and indicates by a single bit whether 
or not that message is the final integer (in its standard binary representation). 
Because a length is a positive integer and all positive integers begin with ‘1’, 
all the leading 1s can be omitted. 


stint Algorithm 7.4. Elias’s encoder for 
Write ‘O an integer n. 
Loop { 

If [log n| = 0 halt 


Prepend c,(n) to the written string 
n:=(|log n| 


} 





The encoder of C,, is shown in algorithm 7.4. The encoding is generated 
from right to left. Table 7.5 shows the resulting codewords. 


> Exercise 7.3.7] Show that the Elias code is not actually the best code for a 
prior distribution that expects very large integers. (Do this by construct- 
ing another code and specifying how large n must be for your code to 
give a shorter length than Elias’s.) 
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co(n) 

(0) 

100 

110 
101000 
101010 
101100 
101110 
1110000 


AONATHKRWNHEHIS 


Solutions 


Solution to exercise 7.1 (p.134). 


10 
11 
12 
13 
14 
15 
16 


Cw (n) 
1110010 
1110100 
1110110 
1111000 
1111010 
1111100 
1111110 
10100100000 


31 
32 
45 
63 
64 
127 
128 
255 


cw(n) 
10100111110 
101011000000 
101011011010 
101011111110 
1011010000000 
1011011111110 
10111100000000 
10111111111110 


256 
365 
511 
512 
719 
1023 
1024 
1025 


The use of the end-of-file symbol in a code 


that represents the integer in some base q corresponds to a belief that there is 
a probability of (1/(q + 1)) that the current character is the last character of 
the number. Thus the prior to which this code is matched puts an exponential 
prior distribution over the length of the integer. 


(a) The expected number of characters is q+1 = 256, so the expected length 


of the integer is 256 x 8 ~ 2000 bits. 


(b) We wish to find q such that qlog q ~ 800000 bits. A value of q between 
215 and 216 satisfies this constraint, so 16-bit blocks are roughly the 
optimal size, assuming there is one end-of-file character. 


7 — Codes for Integers 


Cw (n) 
1110001000000000 
1110001011011010 
1110001111111110 
11100110000000000 
11100110110011110 
11100111111111110 
111010100000000000 
111010100000000010 


Table 7.5. Elias’s ‘universal’ code 
for integers. Examples from 1 to 
1025. 
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Dependent Random Variables 


In the last three chapters on data compression we concentrated on random 
vectors x coming from an extremely simple probability distribution, namely 
the separable distribution in which each component x, is independent of the 
others. 

In this chapter, we consider joint ensembles in which the random variables 
are dependent. This material has two motivations. First, data from the real 
world have interesting correlations, so to do data compression well, we need 
to know how to work with models that include dependences. Second, a noisy 
channel with input x and output y defines a joint ensemble in which z and y are 
dependent — if they were independent, it would be impossible to communicate 
over the channel — so communication over noisy channels (the topic of chapters 
9-11) is described in terms of the entropy of joint ensembles. 


> 8.1 More about entropy 
This section gives definitions and exercises to do with entropy, carrying on 
from section 2.4. 


The joint entropy of X,Y is: 
1 


H(X,Y)= X P(2,¥) log Syy (8.1) 
ryCAx Ay ; 


Entropy is additive for independent random variables: 
A(X,Y) = A(X)+ A(Y) iff P(x,y) = P(x)P(y). (8.2) 


The conditional entropy of X given y=b, is the entropy of the proba- 
bility distribution P(x | y=b,). 


H(X|y=be) = $` P(a|y=dx) log 


reAx 


1 
Pesta 


The conditional entropy of X given Y is the average, over y, of the con- 
ditional entropy of X given y. 


HXIY) = Ð Po | D Peles 
yEAy ztEÅx 
1 


This measures the average uncertainty that remains about x when y is 
known. 
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The marginal entropy of X is another name for the entropy of X, H(X), 
used to contrast it with the conditional entropies listed above. 


Chain rule for information content. From the product rule for probabil- 
ities, equation (2.6), we obtain: 


1 
+ log ——— (8.5) 


1 1 
lE Sy) 7 lE Bay P(y |x) 


h(x, y) = h(x) + h(y |2). (8.6) 


In words, this says that the information content of x and y is the infor- 
mation content of x plus the information content of y given zx. 


Chain rule for entropy. The joint entropy, conditional entropy and 
marginal entropy are related by: 


H(X,Y) = H(X) + H(Y |X) = H(Y) + H(X |Y). (8.7) 


In words, this says that the uncertainty of X and Y is the uncertainty 
of X plus the uncertainty of Y given X. 


The mutual information between X and Y is 
I(X;Y) = H(X)-H(X|Y), (8.8) 


and satisfies I(X;Y) = I(Y; X), and I(X;Y) > 0. It measures the 
average reduction in uncertainty about x that results from learning the 
value of y; or vice versa, the average amount of information that x 
conveys about y. 


The conditional mutual information between X and Y given z=c, 
is the mutual information between the random variables X and Y in 
the joint ensemble P(x, y|z=cx), 


I(X;Y |z=ck) = A(X | z=cy) — H(X | Y, z= cz). (8.9) 


The conditional mutual information between X and Y given Z is 
the average over z of the above conditional mutual information. 


I(X;Y |Z) = H(X | Z) — H(X |Y, Z). (8.10) 


No other ‘three-term entropies’ will be defined. For example, expres- 
sions such as I(X;Y; Z) and I(X |Y; Z) are illegal. But you may put 
conjunctions of arbitrary numbers of variables in each of the three spots 
in the expression I(X;Y |Z) — for example, I(A, B; C, D | E, F) is fine: 
it measures how much information on average c and d convey about a 
and b, assuming e and f are known. 


Figure 8.1 shows how the total entropy H(X,Y) of a joint ensemble can be 
broken down. This figure is important. *K 
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140 8 — Dependent Random Variables 
(X,Y) Figure 8.1. The relationship 
between joint information, 
H(X) marginal entropy, conditional 





entropy and mutual entropy. 





H(X|Y) I(X;Y) || HYIX) 


»> 8.2 Exercises 





> Exercise 8.1.!4] Consider three independent random variables u,v, w with en- 
tropies Ha, Hy, Hy. Let X = (U,V) and Y =(V,W). What is H(X,Y)? 
What is H(X |Y)? What is I(X;Y)? 

> Exercise 8.2.19 P-142] Referring to the definitions of conditional entropy (8.3- 
8.4), confirm (with an example) that it is possible for H(X |y=b,) to 
exceed H(X), but that the average, H(X|Y), is less than H(X). So 
data are helpful — they do not increase uncertainty, on average. 


> Exercise 8.3.1% P-143] Prove the chain rule for entropy, equation (8.7). 
[H(X,Y) = H(X) + H(Y|X)}. 


Exercise 8.4.1% P-M!3] Prove that the mutual information I(X;Y) = A(X) - 
= H(X |Y) satisfies I(X;Y) = I(Y; X) and I(X;Y) > 0. 


[Hint: see exercise 2.26 (p.37) and note that 


I(X;Y) = Dgi(P(x,y)||P(2)P(y))] (8.11) 


Exercise 8.5.4] The ‘entropy distance’ between two random variables can be 
defined to be the difference between their joint entropy and their mutual 
information: 


Dy(X,Y) = H(X,Y) — I(X;Y). (8.12) 


Prove that the entropy distance satisfies the axioms for a distance — 
Dy (X,Y) > 0, Dy(X, X)=0, D(X, Y)=D4(Y, X), and Dy(X, Z) < 
Du(X,Y) + Du(Y, Z). [Incidentally, we are unlikely to see Dy(X,Y) 
again but it is a good function on which to practise inequality-proving. | 





Exercise 8.6.17 P147] A joint ensemble XY has the following joint distribution. 
P(x,y) z Tei 
1 2 3 4 
1 Been 
I 1 1 1 
/s Wie 1/32 1/32 5 ames 
y Tie Ye 1/32 1/32 3 SEER 
he 1/16 116 Wie 4 | 











14 0 0 0 





What is the joint entropy H(X,Y)? What are the marginal entropies 
H(X) and H(Y)? For each value of y, what is the conditional entropy 
H(X |y)? What is the conditional entropy H(X |Y)? What is the 
conditional entropy of Y given X? What is the mutual information 
between X and Y? 
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8.3: Further exercises 


Exercise 8.7.1% P-143] Consider the ensemble XYZ in which Ax = Ay = 

= Az = {0,1}, z and y are independent with Px = {p,1 — p} and 
Py = {q, 1—q} and 

z = (x + y) mod 2. (8.13) 


(a) If q = 1/2, what is Pz? What is I(Z; X)? 

(b) For general p and q, what is Pz? What is I(Z; X)? Notice that 
this ensemble is related to the binary symmetric channel, with 7 = 
input, y = noise, and z = output. 


2 


HY) 


H(Y1X) H(X,Y) d 





Three term entropies 


Exercise 8.8.19 P-143] Many texts draw figure 8.1 in the form of a Venn diagram 


(figure 8.2). Discuss why this diagram is a misleading representation 
of entropies. Hint: consider the three-variable ensemble XY Z in which 
x € {0,1} and y € {0,1} are independent binary variables and z € {0,1} 
is defined to be z = xz + y mod 2. 


> 8.3 Further exercises 


The data-processing theorem 

The data processing theorem states that data processing can only destroy 

information. 

Exercise 8.9.1 P-144] Prove this theorem by considering an ensemble W DR 

= in which w is the state of the world, d is data gathered, and r is the 
processed data, so that these three variables form a Markov chain 


w—> d— r, (8.14) 
that is, the probability P(w, d,r) can be written as 
P(w,d,r) = P(w)P(d|w)P(r |d). (8.15) 


Show that the average information that R conveys about W, I(W; R), is 
less than or equal to the average information that D conveys about W, 
I(W; D). 


This theorem is as much a caution about our definition of ‘information’ as it 
is a caution about data processing! 


Figure 8.2. A misleading 
representation of entropies 
(contrast with figure 8.1). 
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Inference and information measures 


= Exercise 8.10, [1 The three cards. 


= 


(a) One card is white on both faces; one is black on both faces; and one 


is white on one side and black on the other. The three cards are 
shuffled and their orientations randomized. One card is drawn and 
placed on the table. The upper face is black. What is the colour of 
its lower face? (Solve the inference problem.) 


(b) Does seeing the top face convey information about the colour of 


the bottom face? Discuss the information contents and entropies 
in this situation. Let the value of the upper face’s colour be u and 
the value of the lower face’s colour be l. Imagine that we draw 
a random card and learn both u and l. What is the entropy of 
u, H(U)? What is the entropy of l, H(L)? What is the mutual 
information between U and L, I(U; L)? 


Entropies of Markov processes 


> Exercise 8.11.1] In the guessing game, we imagined predicting the next letter 


in a document starting from the beginning and working towards the end. 
Consider the task of predicting the reversed text, that is, predicting the 
letter that precedes those already known. Most people find this a harder 
task. Assuming that we model the language using an N-gram model 
(which says the probability of the next character depends only on the 
N — 1 preceding characters), is there any difference between the average 
information contents of the reversed language and the forward language? 


> 8.4 Solutions 


Solution to exercise 8.2 (p.140). See exercise 8.6 (p.140) for an example where 
H(X |y) exceeds H(X) (set y=3). 


We can prove the inequality H(X |Y) < H(X) by turning the expression 
into a relative entropy (using Bayes’ theorem) and invoking Gibbs’ inequality 


(exercise 2.26 (p.37)): 


A(X |Y) 


1 


= a P(y) 2 Ple yE Baia) 
1 
= Eon P(x,y) loe Bel (8.16) 
= x x) 10 a) 
= ZF )Py|2) 108 STA PG) (8.17) 


1 Pty) 
2, P(x) log Pa * 3 P(x) 2 P(y| 2) log PUIT (8.18) 





The last expression is a sum of relative entropies between the distributions 
P(y|x) and P(y). So 


H(X|Y) < H(X) +0, (8.19) 


with equality only if P(y |x) = P(y) for all x and y (that is, only if X and Y 


are independent). 


8 — Dependent Random Variables 
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8.4: Solutions 143 


Solution to exercise 8.3 (p.140). The chain rule for entropy follows from the 
decomposition of a joint probability: 





1 
H(X,Y) = S°P(a,y) log Fea (8.20) 
1 1 
= 2r) P(y|2) [ere Pa) + log — (8.21) 
= ea pee 2 Ply | x) log ——— Py AUD 5 (8.22) 
= RO (8.23) 
Solution to exercise 8.4 (p.140). Symmetry of mutual information: 
I(X;Y) = H(X)- H(X|Y) (8.24) 
1 
= 2a log —— PE) R x,y) log ———— aay (8.25) 
= ee P(x, y) log oe | ny (8.26) 
= talos LED 
= 2i y) log ———— POPO (8.27) 
This expression is symmetric in x and y so 
I(X;Y) = H(X)- H(X|Y)= H(Y)- H(Y |X). (8.28) 


We can prove that mutual information is positive in two ways. One is to 


continue from É 
=o x, y) log O 


: 


(8.29) 


which is a relative entropy and use Gibbs’ inequality (proved on p.44), which 
asserts that this relative entropy is > 0, with equality only if P(x,y) = 
P(«x)P(y), that is, if X and Y are independent. 

The other is to use Jensen’s inequality on 





-Y P(x, y) log PE) S22 log X` Ple Y) pig) P(y) =log1=0. (8.30) 


P(x,y) oy Pey) 


Solution to exercise 8.7 (p.141). z= z+ ymod2. 
(a) If q = 1⁄2, Pz = {1⁄2, 1/2} and I(Z; X) = H(Z) — H(Z|X) =1-1=0. 





(b) For general q and p, Pz = {pq+(1-p)(1-9), p(1-q)+q(1-»)}. The mutual 
information is I(Z; X) = H(Z)—H(Z|X) = Ho(pq+(1-p)(1-9))—Aa(q). 


Three term entropies 


Solution to exercise 8.8 (p.141). The depiction of entropies in terms of Venn 
diagrams is misleading for at least two reasons. 

First, one is used to thinking of Venn diagrams as depicting sets; but what 
are the ‘sets’ H(X) and H(Y) depicted in figure 8.2, and what are the objects 
that are members of those sets? I think this diagram encourages the novice 
student to make inappropriate analogies. For example, some students imagine 
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A. UX YIZ) Figure 8.3. A misleading 
~ <—— representation of entropies, 
continued. 
a [S na m > 
WV DY 
H(ZIY) \ aan) 


that the random outcome (x,y) might correspond to a point in the diagram, 
and thus confuse entropies with probabilities. 

Secondly, the depiction in terms of Venn diagrams encourages one to be- 
lieve that all the areas correspond to positive quantities. In the special case of 
two random variables it is indeed true that H(X |Y), I(X;Y) and H(Y |X) 
are positive quantities. But as soon as we progress to three-variable ensembles, 
we obtain a diagram with positive-looking areas that may actually correspond 
to negative quantities. Figure 8.3 correctly shows relationships such as 


H(X) + H(Z|X)+H(Y|X,Z) = H(X,Y, Z). (8.31) 





But it gives the misleading impression that the conditional mutual information 
I(X;Y |Z) is less than the mutual information I(X;Y). In fact the area 
labelled A can correspond to a negative quantity. Consider the joint ensemble 
(X,Y,Z) in which x € {0,1} and y € {0,1} are independent binary variables 
and z € {0,1} is defined to be z = z + ymod2. Then clearly H(X) = 
H(Y) = 1 bit. Also H(Z) = 1 bit. And H(Y |X) = H(Y) = 1 since the two 
variables are independent. So the mutual information between X and Y is 
zero. I(X;Y) =0. However, if z is observed, X and Y become dependent — 
knowing x, given z, tells you what y is: y = z—amod2. So I(X;Y|Z) =1 
bit. Thus the area labelled A must correspond to —1 bits for the figure to give 
the correct answers. 

The above example is not at all a capricious or exceptional illustration. The 
binary symmetric channel with input X, noise Y, and output Z is a situation 
in which I(X;Y) = 0 (input and noise are independent) but I(X;Y |Z) > 0 
(once you see the output, the unknown input and the unknown noise are 
intimately related!). 

The Venn diagram representation is therefore valid only if one is aware 
that positive areas may represent negative quantities. With this proviso kept 
in mind, the interpretation of entropies in terms of sets can be helpful (Yeung, 
1991). 


Solution to exercise 8.9 (p.141). For any joint ensemble XYZ, the following 
chain rule for mutual information holds. 


I(X;Y,Z) = 1(X;Y)+1(X;Z|Y). (8.32) 
Now, in the case w — d — r, w and r are independent given d, so 
I(W; R|D) = 0. Using the chain rule twice, we have: 
I(W; D, R) = I(W; D) (8.33) 
and 
I(W; D, R) = 1(W; R) + 1(W; D| R), (8.34) 
so 


I(W; R) — I(W; D) <0. (8.35) 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


About Chapter 9 


Before reading Chapter 9, you should have read Chapter 1 and worked on 
exercise 2.26 (p.37), and exercises 8.2-8.7 (pp. 140-141). 
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Communication over a Noisy Channel 


> 9.1 The big picture 


Source 


SOURCE 
Compressor Decompressor 
CODING 
CHANNEL 
Encoder Decoder 
CODING 


~ 


Noisy | 
channel 


In Chapters 4-6, we discussed source coding with block codes, symbol codes 
and stream codes. We implicitly assumed that the channel from the compres- 
sor to the decompressor was noise-free. Real channels are noisy. We will now 
spend two chapters on the subject of noisy-channel coding — the fundamen- 
tal possibilities and limitations of error-free communication through a noisy 
channel. The aim of channel coding is to make the noisy channel behave like 
a noiseless channel. We will assume that the data to be transmitted has been 
through a good compressor, so the bit stream has no obvious redundancy. The 
channel code, which makes the transmission, will put back redundancy of a 
special sort, designed to make the noisy received signal decodeable. 

Suppose we transmit 1000 bits per second with pọ = pı = 1/2 over a 
noisy channel that flips bits with probability f = 0.1. What is the rate of 
transmission of information? We might guess that the rate is 900 bits per 
second by subtracting the expected number of errors per second. But this is 
not correct, because the recipient does not know where the errors occurred. 
Consider the case where the noise is so great that the received symbols are 
independent of the transmitted symbols. This corresponds to a noise level of 
f = 0.5, since half of the received symbols are correct due to chance alone. 
But when f = 0.5, no information is transmitted at all. 

Given what we have learnt about entropy, it seems reasonable that a mea- 
sure of the information transmitted is given by the mutual information between 
the source and the received signal, that is, the entropy of the source minus the 
conditional entropy of the source given the received signal. 

We will now review the definition of conditional entropy and mutual in- 
formation. Then we will examine whether it is possible to use such a noisy 
channel to communicate reliably. We will show that for any channel Q there 
is a non-zero rate, the capacity C(Q), up to which information can be sent 
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9.2: Review of probability and information 147 
with arbitrarily small probability of error. 


»> 9.2 Review of probability and information 


As an example, we take the joint distribution XY from exercise 8.6 (p.140). 
The marginal distributions P(x) and P(y) are shown in the margins. 


1 2 3 4 
Vg Wig 132 1/32 
ie Ys 1/32 1/32 
ie 116 Wie the 





The joint entropy is H(X, Y) = 27/8 bits. The marginal entropies are H(X) = 
7/4 bits and H(Y) = 2 bits. 

We can compute the conditional distribution of x for each value of y, and 
the entropy of each of those conditional distributions: 


P(x|y) z H(X |y)/bits 
1 2 3 4 
1 1/4 1/3 1/8 
y 2 1/2 1/3 1/8 
3 1/4 1⁄4 1⁄4 
4 0 





Note that whereas H(X |y= 4) = 0 is less than H(X), H(X | y=3) is greater 
than H(X). So in some cases, learning y can increase our uncertainty about 
x. Note also that although P(x |y=2) is a different distribution from P(x), 
the conditional entropy H(X |y =2) is equal to H(X). So learning that y 
is 2 changes our knowledge about x but does not reduce the uncertainty of 
x, as measured by the entropy. On average though, learning y does convey 
information about x, since H(X |Y) < H(X). 

One may also evaluate H(Y|X) = 13/8 bits. The mutual information is 
I(X;Y) = A(X) — H(X |Y) = 3/8 bits. 


»> 9.3 Noisy channels 


A discrete memoryless channel Q is characterized by an input alphabet 
Ax, an output alphabet Ay, and a set of conditional probability distri- 
butions P(y|a), one for each z € Ax. 


These transition probabilities may be written in a matrix 


Qili = Ply=b; |z=ai). (9.1) 


I usually orient this matrix with the output variable j indexing the rows 
and the input variable 7 indexing the columns, so that each column of Q is 
a probability vector. With this convention, we can obtain the probability 
of the output, py, from a probability distribution over the input, px, by 
right-multiplication: 


Py = Qpx. (9.2) 
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148 9 — Communication over a Noisy Channel 


Some useful model channels are: 


Binary symmetric channel. Ax = {0,1}. Ay ={0, 1}. 


















































ol 
0—0 = — = — K = = = ` A 
1-1 Py=1|z=0) = f; P(y=1|z=1) = 1-f. 
Binary erasure channel. Ax = {0,1}. Ay ={0,?, 1}. 
0 1 
0~>0 P(y=0|x=0) = 1-f; P(y=0|r=1) = 0; a 
z ye Py=?|2=0) = f;  Ply=?|e=1) = f; te 
11. P(y=1|r=0) = 0; P(y=1|a=1) 1-f 
Noisy typewriter. Ay = Ay = the 27 letters {A, B, ..., Z, -}. The letters 
are arranged in a circle, and when the typist attempts to type B, what 
comes out is either A, B or C, with probability 1/3 each; when the input is 
C, the output is B, C or D; and so forth, with the final letter ‘-’ adjacent 
to the first letter A. 
ABCDEFGHIJKLMNOPORSTUVWXYZ- 
PSF a=). = 1/3; : ani 
P(y=G|x=G) = 1/3; y "h, 
P(y=H|x=G) = 1/3; : gHr 
Z channel. Ax = {0,1}. Ay = {0,1}. 
0 1 
Pe Y P(y=0|2=0) = 1; Ply=0|z=1) = f; o [ms 
1^1 P(y=1|x=0) = 0; P(y=1|r=1) = 1-f ! 











> 9.4 Inferring the input given the output 


If we assume that the input x to a channel comes from an ensemble X, then 
we obtain a joint ensemble XY in which the random variables x and y have 
the joint distribution: 

P(x,y) = Ply|x)P(a). (9.3) 
Now if we receive a particular symbol y, what was the input symbol £x? We 
typically won’t know for certain. We can write down the posterior distribution 
of the input using Bayes’ theorem: 


P(y|a)P(a P(y|a)P(a 
Pep = PEDRE) PUDE 
P(y) Ve Ply|2’)P(2’) 
Example 9.1. Consider a binary symmetric channel with probability of error 


f=0.15. Let the input ensemble be Px : {po =0.9, p1 =0.1}. Assume 
we observe y=1. 


(9.4) 


P(y=1|a=1)P(x#=1) 
Ver Ply |a)P(2’) 
0.85 x 0.1 
0.85 x 0.1 + 0.15 x 0.9 

0.085 


P(a=1|y=1) 
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9.5: Information conveyed by a channel 


Thus ‘x=? is still less probable than ‘x= 0’, although it is not as im- 
probable as it was before. 


[1, p.157] 


“Exercise 9.2. Now assume we observe y= 0. Compute the probability 


= of x=1 given y=0. 


Example 9.3. Consider a Z channel with probability of error f=0.15. Let the 
input ensemble be Px : {po =0.9,p1=0.1}. Assume we observe y=1. 


ee ene 0.85 x 0.1 
= IY=- ee OL 05009 
0.085 
aes = SG 9.6 
0.085 (28) 


So given the output y=1 we become certain of the input. 


Exercise 9.4,[4 P-157] Alternatively, assume we observe y=0. Compute 
= P(x=1|y=0). 


> 9.5 Information conveyed by a channel 


We now consider how much information can be communicated through a chan- 
nel. In operational terms, we are interested in finding ways of using the chan- 
nel such that all the bits that are communicated are recovered with negligible 
probability of error. In mathematical terms, assuming a particular input en- 
semble X, we can measure how much information the output conveys about 
the input by the mutual information: 


I(X;Y) = H(X) — H(X |Y) = H(Y) — H(Y|X). (9.7) 


Our aim is to establish the connection between these two ideas. Let us evaluate 
I(X;Y) for some of the channels above. 


Hint for computing mutual information 


We will tend to think of I(X;Y) as H(X) — H(X |Y), i.e., how much the 
uncertainty of the input X is reduced when we look at the output Y. But for 
computational purposes it is often handy to evaluate H(Y)— H(Y |X) instead. 


H(X,Y) 
H(X) 
H(Y) 
A(X |Y) I(X;Y) || EYIX) 


Example 9.5. Consider the binary symmetric channel again, with f =0.15 and 
Px : {po =0.9,p; =0.1}. We already evaluated the marginal probabil- 
ities P(y) implicitly above: P(y=0) = 0.78; P(y=1) = 0.22. The 
mutual information is: 


I(X;Y) = H(Y)—H(Y|X). 
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Figure 9.1. The relationship 
between joint information, 
marginal entropy, conditional 
entropy and mutual entropy. 
This figure is important, so m 
showing it twice. 
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150 9 — Communication over a Noisy Channel 


What is H(Y |X)? It is defined to be the weighted sum over x of H(Y | x); 
but H(Y |x) is the same for each value of z: H(Y |x=0) is H(0.15), 
and H(Y |x= 1) is H2(0.15). So 


I(X;Y) = H(Y)—H(Y|X) 
= H(0.22) — H>(0.15) 
= 0.76—0.61 = 0.15 bits. (9.8) 


This may be contrasted with the entropy of the source H(X) = 
H»2(0.1) = 0.47 bits. 


Note: here we have used the binary entropy function H2(p) = H(p,1— 
p)= plog 4 + (1 — p) log a Throughout this book, log means 
logs. 
Example 9.6. And now the Z channel, with Px as above. P(y=1)=0.085. 


I(X;Y) H(Y) — H(Y|X) 
= H}(0.085) — [0.9H2(0) + 0.1H2(0.15)] 


0.42 — (0.1 x 0.61) = 0.36 bits. (9.9) 


The entropy of the source, as above, is H(X) = 0.47 bits. Notice that 
the mutual information I(X;Y) for the Z channel is bigger than the 
mutual information for the binary symmetric channel with the same f. 
The Z channel is a more reliable channel. 

Exercise 9.7.[4 P-157] Compute the mutual information between X and Y for 
= the binary symmetric channel with f =0.15 when the input distribution 

is Px = {pọ =0.5, pı =0.5}. 

Exercise 9.8.1% P-157] Compute the mutual information between X and Y for 
= the Z channel with f = 0.15 when the input distribution is Px : 


{po =0.5, pı =0.5}. 


Maximizing the mutual information 


We have observed in the above examples that the mutual information between 
the input and the output depends on the chosen input ensemble. 

Let us assume that we wish to maximize the mutual information conveyed 
by the channel by choosing the best possible input ensemble. We define the 
capacity of the channel to be its maximum mutual information. 


The capacity of a channel Q is: 


C(Q) = I(X;Y). (9.10) 


The distribution Px that achieves the maximum is called the optimal 
input distribution, denoted by PX. [There may be multiple optimal 
input distributions achieving the same value of I(X;Y).] 


In Chapter 10 we will show that the capacity does indeed measure the maxi- 
mum amount of error-free information that can be transmitted over the chan- 
nel per unit time. 


Example 9.9. Consider the binary symmetric channel with f=0.15. Above, 
we considered Px = {pp =0.9, pı =0.1}, and found I(X;Y) = 0.15 bits. 
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9.6: The noisy-channel coding theorem 
How much better can we do? By symmetry, the optimal input distribu- 
tion is {0.5,0.5} and the capacity is 
C(Qpsc) = H2(0.5) — H2(0.15) = 1.0 — 0.61 = 0.39bits. (9.11) 


We'll justify the symmetry argument later. If there’s any doubt about 
the symmetry argument, we can always resort to explicit maximization 
of the mutual information I(X;Y), 


I(X;Y) = Ha((1- f)pı + (1-p1)f) 





Hə(f) (figure 9.2). (9.12) 
Example 9.10. The noisy typewriter. The optimal input distribution is a uni- 


form distribution over x, and gives C = log, 9 bits. 


Example 9.11. Consider the Z channel with f =0.15. Identifying the optimal 
input distribution is not so straightforward. We evaluate I(X;Y) explic- 
itly for Px = {po, pi}. First, we need to compute P(y). The probability 
of y=1 is easiest to write down: 


P(y=1) = p(l- f). (9.13) 
Then the mutual information is: 
I(X;Y) = H(Y)-H(Y|X) 
= H(pı(1 — f)) — (poH2(0) + pı H2(f)) 
= Hə(pı(1- f)) — pıH2(f). (9.14) 


This is a non-trivial function of p1, shown in figure 9.3. It is maximized 
for f = 0.15 by pj = 0.445. We find C(Qz) = 0.685. Notice the optimal 
input distribution is not {0.5,0.5}. We can communicate slightly more 
information by using input symbol 0 more frequently than 1. 


Exercise 9.12.14» P-158] What is the capacity of the binary symmetric channel 
= for general f? 


Exercise 9.13.1% P158] Show that the capacity of the binary erasure channel 
= with f = 0.15 is Cggc = 0.85. What is its capacity for general f? 
Comment. 


> 9.6 The noisy-channel coding theorem 


It seems plausible that the ‘capacity’ we have defined may be a measure of 
information conveyed by a channel; what is not obvious, and what we will 
prove in the next chapter, is that the capacity indeed measures the rate at 
which blocks of data can be communicated over the channel with arbitrarily 
small probability of error. 

We make the following definitions. 


An (N, K) block code for a channel Q is a list of S = 2% codewords 


x... xO), xe AÑ, 

each of length N. Using this code we can encode a signal s € 
{1,2,3,...,25} as x). [The number of codewords S is an integer, 
but the number of bits specified by choosing a codeword, K = log, S, is 
not necessarily an integer.] 
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Figure 9.2. The mutual 
information I(X;Y) for a binary 
symmetric channel with f = 0.15 
as a function of the input 
distribution. 
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Figure 9.3. The mutual 
information I(X;Y) for a Z 
channel with f = 0.15 as a 
function of the input distribution. 
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152 9 — Communication over a Noisy Channel 


The rate of the code is R = K/N bits per channel use. 


[We will use this definition of the rate for any channel, not only chan- 
nels with binary inputs; note however that it is sometimes conventional 
to define the rate of a code for a channel with q input symbols to be 


K/(N log q).] 
A decoder for an (N, K) block code is a mapping from the set of length-N 
strings of channel outputs, AS, to a codeword label § € {0,1,2,...,2*}. 


The extra symbol §=0 can be used to indicate a ‘failure’. 


The probability of block error ofa code and decoder, for a given channel, 
and for a given probability distribution over the encoded signal P(sin), 
is: 

PB = 5 P( Sin) P(Sout £ Sin | Sin). (9.15) 


The maximal probability of block error is 


PBM = max P(Sout £ Sin | Sin). (9.16) 


The optimal decoder for a channel code is the one that minimizes the prob- 
ability of block error. It decodes an output y as the input s that has 
maximum posterior probability P(s |y). 

P(y|s)P(s) 


P(s|y) = ys PY) Ps) (9.17) 


Soptimal = argmax P(s | y). (9.18) 


A uniform prior distribution on s is usually assumed, in which case the PBM 
optimal decoder is also the maximum likelihood decoder, i.e., the decoder 
that maps an output y to the input s that has maximum likelihood 





l 
P(y |s). achievable 
The probability of bit error pp is defined assuming that the codeword ! 
number s is represented by a binary vector s of length K bits; it is the C R 
average probability that a bit of Sout is not equal to the corresponding 
bit of Sin (averaging over all K bits). Figure 9.4. Portion of the R, ppm 


plane asserted to be achievable by 
Shannon’s noisy-channel coding theorem (part one). Associated with the first part of Shannon’s noisy 
each discrete memoryless channel, there is a non-negative number © channel coding theorem. 
(called the channel capacity) with the following property. For any € > 0 
and R < C, for large enough N, there exists a block code of length N and 
rate > R and a decoding algorithm, such that the maximal probability 
of block error is < e. 


Confirmation of the theorem for the noisy typewriter channel 


In the case of the noisy typewriter, we can easily confirm the theorem, because 
we can create a completely error-free communication strategy using a block 
code of length N = 1: we use only the letters B, E, H, ..., Z, i.e., every third 
letter. These letters form a non-confusable subset of the input alphabet (see 
figure 9.5). Any output can be uniquely decoded. The number of inputs 
in the non-confusable subset is 9, so the error-free information rate of this 
system is logs 9 bits, which is equal to the capacity C, which we evaluated in 
example 9.10 (p.151). 
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Figure 9.5. A non-confusable 
subset of inputs for the noisy 
typewriter. 
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Figure 9.6. Extended channels 
obtained from a binary symmetric 
channel with transition 
probability 0.15. 
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How does this translate into the terms of the theorem? The following table 


explains. 
The theorem How it applies to the noisy typewriter 
Associated with each discrete The capacity C is logs 9. 


memoryless channel, there is a 
non-negative number C. 


For anye>0 and R < C, for large No matter what € and R are, we set the blocklength N to 1. 
enough N, 

there exists a block code of length N and The block code is {B,E,...,Z}. The value of K is given by 
rate > R 2k = 9, so K = logs 9, and this code has rate logs 9, which is 


greater than the requested value of R. 
and a decoding algorithm, The decoding algorithm maps the received letter to the nearest 
letter in the code; 


such that the maximal probability of the maximal probability of block error is zero, which is less 
block error is < e. than the given e. 


> 9.7 Intuitive preview of proof 


Extended channels 


To prove the theorem for any given channel, we consider the extended channel 
corresponding to N uses of the channel. The extended channel has |Ax|% 
possible inputs x and |Ay| possible outputs. Extended channels obtained 
from a binary symmetric channel and from a Z channel are shown in figures 
9.6 and 9.7, with N = 2 and N = 4. 
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SOE OC NOSN OO Figure 9.7. Extended channels 
oooo| mae mee meee obtained from a Z channel with 
1000 Mi mE? a 6 transition probability 0.15. Each 
0100 7 n. n. : : 
1100 E : : ‘ column corresponds to an input, 
0010 HÉ. mee and each row is a different output. 
1010 E a 
0110 | n. 
1110 E = 
0001 H=... 
1001 EH: - 
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1101 E à 
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N=1 N=2 N=4 
igure 9.8. (a) Some ica 
> D F 9.8 S typical 


outputs in AN corresponding to 
typical inputs x. (b) A subset of 


the typical sets shown in (a) that 
do not overlap each other. This 





picture can be compared with the 
solution to the noisy typewriter in 


(XO) figure 9.5. 








Typical y for a given typical x 


(a) (b) 





Exercise 9.14.1% P-159] Find the transition probability matrices Q for the ex- 
= tended channel, with N = 2, derived from the binary erasure channel 
having erasure probability 0.15. 


By selecting two columns of this transition probability matrix, we can 
define a rate-!/2 code for this channel with blocklength N = 2. What is 
the best choice of two columns? What is the decoding algorithm? 


To prove the noisy-channel coding theorem, we make use of large block- 
lengths N. The intuitive idea is that, if N is large, an extended channel looks 
a lot like the noisy typewriter. Any particular input x is very likely to produce 
an output in a small subspace of the output alphabet — the typical output set, 
given that input. So we can find a non-confusable subset of the inputs that 
produce essentially disjoint output sequences. For a given N, let us consider 
a way of generating such a non-confusable subset of the inputs, and count up 
how many distinct inputs it contains. 

Imagine making an input sequence x for the extended channel by drawing 
it from an ensemble X^, where X is an arbitrary ensemble over the input 
alphabet. Recall the source coding theorem of Chapter 4, and consider the 
number of probable output sequences y. The total number of typical output 
sequences y is 2’), all having similar probability. For any particular typical 
input sequence x, there are about 2N#(1*) probable sequences. Some of these 
subsets of Ay are depicted by circles in figure 9.8a. 

We now imagine restricting ourselves to a subset of the typical inputs 
x such that the corresponding typical output sets do not overlap, as shown 
in figure 9.8b. We can then bound the number of non-confusable inputs by 
dividing the size of the typical y set, 2"), by the size of each typical-y- 
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YIX). So the number of non-confusable inputs, if they 


9NH(Y)-NH(Y|X) 


given-typical-x set, 2" 4( 


are selected from the set of typical inputs x ~ XN, is < 
9NI(X;Y), 


The maximum value of this bound is achieved if X is the ensemble that 
maximizes I(X;Y), in which case the number of non-confusable inputs is 
< 2NC_ Thus asymptotically up to C bits per cycle, and no more, can be 
communicated with vanishing error probability. o 

This sketch has not rigorously proved that reliable communication really 
is possible — that’s our task for the next chapter. 





> 9.8 Further exercises 


Exercise 9.15.9 P-159] Refer back to the computation of the capacity of the Z 
= channel with f = 0.15. 


(a) Why is pj less than 0.5? One could argue that it is good to favour 
the 0 input, since it is transmitted without error — and also argue 
that it is good to favour the 1 input, since it often gives rise to the 
highly prized 1 output, which allows certain identification of the 
input! Try to make a convincing argument. 


(b) In the case of general f, show that the optimal input distribution 


i et DUEN 


Pi = T a (9.19) 


(c) What happens to pj if the noise level f is very close to 1? 


“Exercise 9.16.1% P159] Sketch graphs of the capacity of the Z channel, the 
= binary symmetric channel and the binary erasure channel as a function 
of f. 
> Exercise 9.17.1?] What is the capacity of the five-input, ten-output channel 
whose transition probability matrix is 


0.25 0 0 0 0.25 














0.25 0 0 0 0.25 01234 
0.25 0.25 0 0 0 sie a 
0.25 0.25 0 0 0 lias 
0 0.25 0.25 0 0 5 | ma 
4 DL ? 
0 0.25 0.25 0 0 sj mm |’ (9.20) 
0 0 0.25 0.25 0 i cH 
0 0 0.25 0.25 0 3 T 
0 0 0 0.25 0.25 
0 0 0 0.25 0.25 


Exercise 9.18.1% P-159] Consider a Gaussian channel with binary input x € 
= {—1,+1} and real output alphabet Ay, with transition probability den- 
sity 
1 _ (u-za)? 
e 20 (9.21) 


Q(y| x,a, o) = V2n02 2: 


where a is the signal amplitude. 





(a) Compute the posterior probability of x given y, assuming that the 
two inputs are equiprobable. Put your answer in the form 


1 


P(z=1|y,a,0) = EELO 


(9.22) 
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156 9 — Communication over a Noisy Channel 


Sketch the value of P(x =1|y,a,c) as a function of y. 


(b) Assume that a single bit is to be transmitted. What is the optimal 
decoder, and what is its probability of error? Express your answer 
in terms of the signal-to-noise ratio a/c? and the error function 
(the cumulative probability function of the Gaussian distribution), 





eal z2 

(z) =f e 2 dz. (9.23) 
-œ V 2T 

[Note that this definition of the error function ®(z) may not corre- 

spond to other people’s.] 


Pattern recognition as a noisy channel 


We may think of many pattern recognition problems in terms of communi- 
cation channels. Consider the case of recognizing handwritten digits (such 
as postcodes on envelopes). The author of the digit wishes to communicate 
a message from the set Ax = {0,1,2,3,...,9}; this selected message is the 
input to the channel. What comes out of the channel is a pattern of ink on 
paper. If the ink pattern is represented using 256 binary pixels, the channel 
Q has as its output a random variable y € Ay = {0,1}?5°. An example of an 
element from this alphabet is shown in the margin. 

















Exercise 9.19.1°] Estimate how many patterns in Ay are recognizable as the 
= character ‘2’. [The aim of this problem is to try to demonstrate the 
existence of as many patterns as possible that are recognizable as 2s.] 


Discuss how one might model the channel P(y|a=2). Estimate the 
entropy of the probability distribution P(y | x =2). 


One strategy for doing pattern recognition is to create a model for 
P(y | x) for each value of the input æ = {0,1,2,3,...,9}, then use Bayes’ 
theorem to infer x given y. 





P(x|y) ee (9.24) 


~ De Pule)P@) i 
This strategy is known as full probabilistic modelling or generative nT 


modelling. This is essentially how current speech recognition systems 


work. In addition to the channel model, P(y |x), one uses a prior proba- 
bility distribution P(x), which in the case of both character recognition 
and speech recognition is a language model that specifies the probability 





of the next character/word given the context and the known grammar 
and statistics of the language. 


Random coding Eh 

“= Exercise 9.20.7 P-160] Given twenty-four people in a room, what is the prob- r 
= ability that there are at least two people present who have the same , Pa 
a 


birthday (i.e., day and month of birth)? What is the expected number 
of pairs of people with the same birthday? Which of these two questions 
is easiest to solve? Which answer gives most insight? You may find it 
helpful to solve these problems and those that follow using notation such 
as A = number of days in year = 365 and S = number of people = 24. 


Figure 9.9. Some more 2s. 


> Exercise 9.21.!#] The birthday problem may be related to a coding scheme. 


Assume we wish to convey a message to an outsider identifying one of 
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the twenty-four people. We could simply communicate a number s from 
Ag = {1,2,...,24}, having agreed a mapping of people onto numbers; 
alternatively, we could convey a number from Ax = {1,2,...,365}, 
identifying the day of the year that is the selected person’s birthday 
(with apologies to leapyearians). [The receiver is assumed to know all 
the people’s birthdays.] What, roughly, is the probability of error of this 
communication scheme, assuming it is used for a single transmission? 
What is the capacity of the communication channel, and what is the 
rate of communication attempted by this scheme? 
> Exercise 9.22.1?! 
containing q people. (You might think of K = 2 and q = 24 as an 
example.) The aim is to communicate a selection of one person from each 
room by transmitting an ordered list of K days (from Ax). Compare 
the probability of error of the following two schemes. 


Now imagine that there are K rooms in a building, each 


(a) As before, where each room transmits the birthday of the selected 
person. 


(b) To each K-tuple of people, one drawn from each room, an ordered 
K-tuple of randomly selected days from Ax is assigned (this K- 
tuple has nothing to do with their birthdays). This enormous list 
of S = q* strings is known to the receiver. When the building has 
selected a particular person from each room, the ordered string of 
days corresponding to that K-tuple of people is transmitted. 


What is the probability of error when q = 364 and K = 1? What is the 
probability of error when q = 364 and K is large, e.g. K = 6000? 


> 9.9 Solutions 
Solution to exercise 9.2 (p.149). If we assume we observe y=0, 


P(y=0|a2=1)P(x#=1) 


eng S PUPE ee) 
0.15 x 0.1 
= e 9.26 
0.15 x 0.1 + 0.85 x 0.9 ( ) 
0.015 
= — = 0.019. 9.27 
0.78 ( ) 


Solution to exercise 9.4 (p.149). If we observe y = 0, 


0.15 x 0.1 
Petia) 2. ee 2 
Gale!) 0.15 x 0.1 + 1.0 x 0.9 (28) 
0.015 
= 2? 0.016. 9.29 
0.915 2:29) 


Solution to exercise 9.7 (p.150). The probability that y = 1 is 0.5, so the 
mutual information is: 


I(X;Y) = H(Y)-H(Y |X) (9.30) 
= H,(0.5) — H2(0.15) (9.31) 
1— 0.61 = 0.39 bits. (9.32) 


Solution to exercise 9.8 (p.150). We again compute the mutual information 
using I(X;Y) = H(Y)— H(Y |X). The probability that y = 0 is 0.575, and 
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158 9 — Communication over a Noisy Channel 


H(Y |X) =, P(2)H(Y |x) = P(w@=1)H(Y |x=1) + P(x=0)H(Y |x=0) 


so the mutual information is: 


I(X;Y) = H(Y)-H(Y |X) (9.33) 
= H,(0.575) — [0.5 x H2(0.15) + 0.5 x 0] (9.34) 
0.98 — 0.30 = 0.679 bits. (9.35) 


Solution to exercise 9.12 (p.151). By symmetry, the optimal input distribution 
is {0.5,0.5}. Then the capacity is 


C = I(X;Y) = H(Y)—H(Y|X) (9.36) 
= H,(0.5) — Ha(f) (9.37) 
= 1-—H,(f). (9.38) 


Would you like to find the optimal input distribution without invoking sym- 
metry? We can do this by computing the mutual information in the general 
case where the input ensemble is {po, pi}: 


I(X;Y) = H(Y)-H(Y |X) (9.39) 
H2(pof + pi(l — f)) — H2(f). (9.40) 


The only p-dependence is in the first term H2(pof + pi(1 — f)), which is 
maximized by setting the argument to 0.5. This value is given by setting 
Po = 1/2. 


Solution to exercise 9.13 (p.151). Answer 1. By symmetry, the optimal input 
distribution is {0.5,0.5}. The capacity is most easily evaluated by writing the 
mutual information as I(X;Y) = H(X) — H(X |Y). The conditional entropy 
H(X|Y) is $`, P(y)H(X |y); when y is known, x is uncertain only if y = ?, 
which occurs with probability f/2+ f/2, so the conditional entropy H(X |Y) 


is fH2(0.5). 
C = I(X;Y) = H(X)-—H(X|Y) (9.41) 
= H2(0.5) — fH2(0.5) (9.42) 
ie (9.43) 


The binary erasure channel fails a fraction f of the time. Its capacity is 
precisely 1 — f, which is the fraction of the time that the channel is reliable. 
This result seems very reasonable, but it is far from obvious how to encode 
information so as to communicate reliably over this channel. 


Answer 2. Alternatively, without invoking the symmetry assumed above, we 
can start from the input ensemble {po,pi}. The probability that y = ? is 
pof +pif = f, and when we receive y = ?, the posterior probability of x is 
the same as the prior probability, so: 


I(X;Y) = H(X)-H(X|Y) (9.44) 
= Hə(pı)— fH2(p1) (9.45) 
= (1- f)Ho(p1). (9.46) 


This mutual information achieves its maximum value of (1— f) when pı = 1/2. 
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Figure 9.10. (a) The extended 
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Solution to exercise 9.14 (p.153). The extended channel is shown in fig- 
ure 9.10. The best code for this channel with N = 2 is obtained by choosing 
two columns that have minimal overlap, for example, columns 00 and 11. The 
decoding algorithm returns ‘00’ if the extended channel output is among the 


top four and ‘11’ if it’s among the bottom four, and gives up if the output is 
P, 


Solution to exercise 9.15 (p.155). In example 9.11 (p.151) we showed that the 
mutual information between input and output of the Z channel is 


I(X;Y) = H(Y)-H(Y|X) 
Hə(pı(1 — f)) — pr H2( f). (9.47) 


We differentiate this expression with respect to p1, taking care not to confuse 
logy with loge: 
d Til) 


gY) = (1 - flows 





Hə(f). (9.48) 
Setting this derivative to zero and rearranging using skills developed in exer- 
cise 2.17 (p.36), we obtain: 


f 1 
pil — f) = mira 





(9.49) 


so the optimal input distribution is 


m= nan (o0) 





T T 
7 





As the noise level f tends to 1, this expression tends to 1/e (as you can prove 
using L’Hopital’s rule). 

For all values of f, pj is smaller than 1/2. A rough intuition for why input / 
1 is used less than input 0 is that when input 1 is used, the noisy channel ale 4 a eee a 
injects entropy into the received string; whereas when input O is used, the N N A e 
noise has zero entropy. 














Figure 9.11. Capacities of the Z 


A ; channel, binary symmetric 
Solution to exercise 9.16 (p.155). The capacities of the three channels are channel, and binary erasure 


shown in figure 9.11. For any f < 0.5, the BEC is the channel with highest channel. 
capacity and the BSC the lowest. 


Solution to exercise 9.18 (p.155). The logarithm of the posterior probability 
ratio, given y, is 
P(x=1|y,a,0) Q(y|r=1,a,0) 


(y) =1 l p22 
a = in = in = $ 
vw" P(e=—1]y,0,0)  Qule=-1,o0) 7 





(9.51) 
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Using our skills picked up from exercise 2.17 (p.36), we rewrite this in the 


form 
1 


1+ e70) 
The optimal decoder selects the most probable hypothesis; this can be done 
simply by looking at the sign of a(y). If a(y) > 0 then decode as ĉ = 1. 

The probability of error is 


P(x=1|y,a, o) = (9.52) 





0 —ra 1 zt ra 
Ph = Boo Ee) gs dy [x52 a =o(-=). 8) 


Random coding 


Solution to exercise 9.20 (p.156). The probability that S = 24 people whose 
birthdays are drawn at random from A = 365 days all have distinct birthdays 
is 


A(A —1)(A-2)...(A-—S+1) 
AS 
The probability that two (or more) people share a birthday is one minus this 
quantity, which, for S = 24 and A = 365, is about 0.5. This exact way of 
answering the question is not very informative since it is not clear for what 
value of S the probability changes from being close to 0 to being close to 1. 
The number of pairs is S(S — 1)/2, and the probability that a particular 
pair shares a birthday is 1/A, so the expected number of collisions is 


. (9.54) 





S(S-1)1 
Fz (9.55) 


This answer is more instructive. The expected number of collisions is tiny if 


S < WA and big if S > VA. 
We can also approximate the probability that all birthdays are distinct, 
for small S, thus: 


BOS =. MADEN daw a ae -iya) 


exp(0) exp(—1/A) exp(—2/A)...exp(—(S—1)/A) (9.56) 


S-1 
exp (- ; i) = exp (-2 2A) : (9.57) 





K 


K 
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About Chapter 10 


Before reading Chapter 10, you should have read Chapters 4 and 9. Exer- 
cise 9.14 (p.153) is especially recommended. 


Cast of characters 


the noisy channel 

the capacity of the channel 

an ensemble used to create a random code 

a random code 

the length of the codewords 

a codeword, the sth in the code 

the number of a chosen codeword (mnemonic: the source 
selects s) 

the total number of codewords in the code 

the number of bits conveyed by the choice of one codeword 
from S, assuming it is chosen with uniform probability 

a binary representation of the number s 

the rate of the code, in bits per channel use (sometimes called 
R' instead) 

the decoder’s guess of s 
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10 


The Noisy-Channel Coding Theorem 


> 10.1 The theorem 


The theorem has three parts, two positive and one negative. The main positive 
result is the first. 


Pb 
1. For every discrete memoryless channel, the channel capacity 
R(p 
C = max I(X;Y) (10.1) (Pe) 
Px 1 2 
has the following property. For any € > 0 and R < C, for large enough N, 3 
there exists a code of length N and rate > R and a decoding algorithm, 
such that the maximal probability of block error is < e. C R 
2. If a probability of bit error p, is acceptable, rates up to R(pp) are achiev- Figure 10.1. Portion of the R, pp 
able, where plane to be proved achievable 
C (1,2) and not achievable (3). 
R(pp) = —=——. 10.2 f 
(o) = E (10.2) 


3. For any pp, rates greater than R(pp) are not achievable. 


»> 10.2 Jointly-typical sequences 


We formalize the intuitive preview of the last chapter. 

We will define codewords x“) as coming from an ensemble X, and con- 
sider the random selection of one codeword and a corresponding channel out- 
put y, thus defining a joint ensemble (X yy . We will use a typical-set decoder, 
which decodes a received signal y as s if x) and y are jointly typical, a term 
to be defined shortly. 

The proof will then centre on determining the probabilities (a) that the 
true input codeword is not jointly typical with the output sequence; and (b) 
that a false input codeword is jointly typical with the output. We will show 
that, for large N, both probabilities go to zero as long as there are fewer than 
2NC codewords, and the ensemble X is the optimal input distribution. 


Joint typicality. A pair of sequences x,y of length N are defined to be 
jointly typical (to tolerance 8) with respect to the distribution P(x,y) 








if 
x is typical of P(x), i.e., eee =e H(X)| < 2, 
N P(x) 
: 3 : 1 1 
y is typical of P(y), i.e., go Py) — HY)| <b; 
and x,y is typical of P(x,y), i.e., L log ray — H(XY) < B. 
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10.2: Jointly-typical sequences 163 


The jointly-typical set Jyg is the set of all jointly-typical sequence pairs 
of length N. 


Example. Here is a jointly-typical pair of length N = 100 for the ensemble 
P(x,y) in which P(x) has (po, pı) = (0.9,0.1) and P(y |x) corresponds to a 
binary symmetric channel with noise level 0.2. 


X 1111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 
y  0011111111000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111 


Notice that x has 10 1s, and so is typical of the probability P(x) (at any 
tolerance 3); and y has 26 1s, so it is typical of P(y) (because P(y = 1) = 0.26); 
and x and y differ in 20 bits, which is the typical number of flips for this 
channel. 


Joint typicality theorem. Let x,y be drawn from the ensemble (XY) 


defined by 
N 
P(x,y) = II P(an, Yn). 
n=1 
Then 


1. the probability that x,y are jointly typical (to tolerance 8) tends 
to las N > œ; 

2. the number of jointly-typical sequences |Jg] is close to 2NHY), 
To be precise, 


|Jyg| ONC (10.3) 


3. if x! ~ XN and y' ~ YY, i.e., x! and y’ are independent samples 
with the same marginal distribution as P(x, y), then the probability 
that (x’,y’) lands in the jointly-typical set is about 2-N/@¥), To 
be precise, 

P((x', y^ € dg) < 2 I, (10.4) 


Proof. The proof of parts 1 and 2 by the law of large numbers follows that 
of the source coding theorem in Chapter 4. For part 2, let the pair x,y 
play the role of x in the source coding theorem, replacing P(x) there by 
the probability distribution P(x, y). 


For the third part, 





P(x,y) eJn) = >> P(x)Ply) (10.5) 
(xy)eIng 
< |Jngl2 N(H(X)-B) 9-N(H(Y)-6 (10.6) 


IA 


2N(H(X,Y)+6)-N(H(X)+H(Y)-28) (10.7) 
9-N(I(X;Y)-38), (10.8) 














A cartoon of the jointly-typical set is shown in figure 10.2. Two independent 
typical vectors are jointly typical with probability 


P((x',y’) € Ing) =~ 2000Y) (10.9) 


because the total number of independent typical pairs is the area of the dashed 
rectangle, 2N4¥)2N4) and the number of jointly-typical pairs is roughly 
QNH(X.Y) so the probability of hitting a jointly-typical pair is roughly 

pA NE) A 2- NI(X;Y) (10.10) 
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164 10 — The Noisy-Channel Coding Theorem 


AX Figure 10.2. The jointly-typical 
set. The horizontal direction 
represents AX, the set of all input 
strings of length N. The vertical 
direction represents AN , the set of 
all output strings of length N. 
The outer box contains all 
conceivable input-output pairs. 
Each dot represents a 
jointly-typical pair of sequences 
(x,y). The total number of 


jointly-typical sequences is about 
QNA(XY) 








Ce ete rate vee re ert 











> 10.3 Proof of the noisy-channel coding theorem 


Analogy 


Imagine that we wish to prove that there is a baby in a class of one hundred 
babies who weighs less than 10 kg. Individual babies are difficult to catch and 
weigh. Shannon’s method of solving the task is to scoop up all the babies 
and weigh them all at once on a big weighing machine. If we find that their 
average weight is smaller than 10 kg, there must exist at least one baby who 
weighs less than 10 kg — indeed there must be many! Shannon’s method isn’t 


guaranteed to reveal the existence of an underweight child, since it relies on Figure 10.3. Shannon’s method for 
proving one baby weighs less than 


10kg. 





there being a tiny number of elephants in the class. But if we use his method 
and get a total weight smaller than 1000 kg then our task is solved. 


From skinny children to fantastic codes 


We wish to show that there exists a code and a decoder having small prob- 
ability of error. Evaluating the probability of error of any particular coding 
and decoding system is not easy. Shannon’s innovation was this: instead of 
constructing a good coding and decoding system and evaluating its error prob- 
ability, Shannon calculated the average probability of block error of all codes, 
and proved that this average is small. There must then exist individual codes 
that have small probability of block error. 


Random coding and typical-set decoding 


Consider the following encoding-decoding system, whose rate is R’. 


1. We fix P(x) and generate the S = 2NF codewords of a (N, NR’) = 
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x3) xO) x(2) x(4) x(3) x) x(2) x(4) 

Figure 10.4. (a) A random code. 
(b) Example decodings by the 
typical set decoder. A sequence 
that is not jointly typical with any 
of the codewords, such as ya, is 
decoded as § = 0. A sequence that 
is jointly typical with codeword 
x) alone, yp, is decoded as § = 3. 
Similarly, ye is decoded as § = 4. 
A sequence that is jointly typical 
with more than one codeword, 
such as ya, is decoded as § = 0. 





























(N, K) code C at random according to 


N 
P(x) = | | Pay). (10.11) 
n=l 


A random code is shown schematically in figure 10.4a. 
2. The code is known to both sender and receiver. 


3. A message s is chosen from {1,2,..., INRY, and x‘) is transmitted. The 
received signal is y, with 


N 
P(y |x) = |] P@n| a). (10.12) 
n=1 


4. The signal is decoded by typical-set decoding. 


Typical-set decoding. Decode y as ŝ if (x), y) are jointly typical and 
there is no other s’ such that (x,y) are jointly typical; 
otherwise declare a failure (ŝ = 0). 


This is not the optimal decoding algorithm, but it will be good enough, 
and easier to analyze. The typical-set decoder is illustrated in fig- 
ure 10.4b. 


5. A decoding error occurs if § Æ s. 


There are three probabilities of error that we can distinguish. First, there 
is the probability of block error for a particular code C, that is, 


pa(C) = P(8 £ s |C). (10.13) 


This is a difficult quantity to evaluate for any given code. 
Second, there is the average over all codes of this block error probability, 


(pp) = X. P(8 # s|C)P(C). (10.14) 
C 
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166 10 — The Noisy-Channel Coding Theorem 


Fortunately, this quantity is much easier to evaluate than the first quantity 


P(§#s8|C). (pp) is just the probability that 
Third, the maximal block error probability of a code C, there is a decoding error at step 5 
of the five-step process on the 
ppmM(C) = max P(§ 4 s|s,C), (10.15) previous page. 
S 


is the quantity we are most interested in: we wish to show that there exists a 
code C with the required rate whose maximal block error probability is small. 

We will get to this result by first finding the average block error probability, 
(pg). Once we have shown that this can be made smaller than a desired small 
number, we immediately deduce that there must exist at least one code C 
whose block error probability is also less than this small number. Finally, 
we show that this code, whose block error probability is satisfactorily small 
but whose maximal block error probability is unknown (and could conceivably 
be enormous), can be modified to make a code of slightly smaller rate whose 
maximal block error probability is also guaranteed to be small. We modify 
the code by throwing away the worst 50% of its codewords. 

We therefore now embark on finding the average probability of block error. 


Probability of error of typical-set decoder 


There are two sources of error when we use typical-set decoding. Either (a) 
the output y is not jointly typical with the transmitted codeword x), or (b) 
there is some other codeword in C that is jointly typical with y. 

By the symmetry of the code construction, the average probability of error 
averaged over all codes does not depend on the selected value of s; we can 
assume without loss of generality that s = 1. 

(a) The probability that the input x!) and the output y are not jointly 
typical vanishes, by the joint typicality theorem’s first part (p.163). We give a 
name, 6, to the upper bound on this probability, satisfying 6 — 0 as N — oo; 
for any desired ô, we can find a blocklength N(6) such that the P((x“),y) g 
Ing) < ô. 

(b) The probability that x(*) and y are jointly typical, for a given s’ #1 
is < 2-NUQsY)-38) by part 3. And there are (QNR’ — 1) rival values of s’ to 
worry about. 

Thus the average probability of error (pp) satisfies: 


2NR' 

(pp) < 8+ XO 2 NEO) (10.16) 
s'=2 

< 8 4 Q-NUGGY)-R-38)_ (10.17) 





The inequality (10.16) that bounds a total probability of error Pror by the 
sum of the probabilities P, of all sorts of events s’ each of which is sufficient 
to cause error, 

Pror < Pat P+, 


is called a union bound. It is only an equality if the different events that cause 
error never occur at the same time as each other. 


The average probability of error (10.17) can be made < 2ô by increasing N if 
R' < I(X;Y) — 32. (10.18) 
We are almost there. We make three modifications: 


1. We choose P(x) in the proof to be the optimal input distribution of the 
channel. Then the condition R’ < I(X;Y) — 38 becomes R’ < C — 38. 


10.4: Communication (with errors) above capacity 
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(a) A random code... (b) expurgated 


2. Since the average probability of error over all codes is < 26, there must 
exist a code with mean probability of block error pp(C) < 20. 


3. To show that not only the average but also the maximal probability of 
error, pgm, can be made small, we modify this code by throwing away 
the worst half of the codewords — the ones most likely to produce errors. 
Those that remain must all have conditional probability of error less 
than 46. We use these remaining codewords to define a new code. This 
new code has 2“ ’—! codewords, i.e., we have reduced the rate from R’ 
to R’—1/N (a negligible reduction, if N is large), and achieved pgm < 40. 
This trick is called expurgation (figure 10.5). The resulting code may 
not be the best code of its rate and length, but it is still good enough to 
prove the noisy-channel coding theorem, which is what we are trying to 
do here. 


In conclusion, we can ‘construct’ a code of rate R’ — !/N, where R’ < C — 38, 
with maximal probability of error < 46. We obtain the theorem as stated by 
setting R' = (R+C)/2, 6 =€/4, B < (C — R’)/3, and N sufficiently large for 
the remaining conditions to hold. The theorem’s first part is thus proved. O 





10.4 Communication (with errors) above capacity 


We have proved, for any discrete memoryless channel, the achievability of a 
portion of the R,p, plane shown in figure 10.6. We have shown that we can 
turn any noisy channel into an essentially noiseless binary channel with rate 
up to C bits per cycle. We now extend the right-hand boundary of the region 
of achievability at non-zero error probabilities. [This is called rate-distortion 
theory.] 

We do this with a new trick. Since we know we can make the noisy channel 
into a perfect channel with a smaller rate, it is sufficient to consider commu- 
nication with errors over a noiseless channel. How fast can we communicate 
over a noiseless channel, if we are allowed to make errors? 

Consider a noiseless binary channel, and assume that we force communi- 
cation at a rate greater than its capacity of 1 bit. For example, if we require 
the sender to attempt to communicate at R=2 bits per cycle then he must 
effectively throw away half of the information. What is the best way to do 
this if the aim is to achieve the smallest possible probability of bit error? One 
simple strategy is to communicate a fraction 1/R of the source bits, and ignore 
the rest. The receiver guesses the missing fraction 1 — 1/R at random, and 
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Figure 10.5. How expurgation 
works. (a) In a typical random 
code, a small fraction of the 
codewords are involved in 
collisions — pairs of codewords are 
sufficiently close to each other 
that the probability of error when 
either codeword is transmitted is 
not tiny. We obtain a new code 
from a random code by deleting 
all these confusable codewords. 
(b) The resulting code has slightly 
fewer codewords, so has a slightly 
lower rate, and its maximal 
probability of error is greatly 
reduced. 





Pb 


epee oe | 


l 
achievable : 
l 


R 


C 


Figure 10.6. Portion of the R, pp 
plane proved achievable in the 
first part of the theorem. [We’ve 
proved that the maximal 
probability of block error pgm can 
be made arbitrarily small, so the 
same goes for the bit error 
probability pp, which must be 
smaller than pgm.] 
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0.3 
the average probability of bit error is E 
0.25 } Optimum —— a 4 
1 Simple Pa 
Beers’ G (pee 02+ g 4 
Po = zQ 1/R). (10.19) i 2 
0.15 + Pa J 
The curve corresponding to this strategy is shown by the dashed line in fig- a é | 
ure 10.7. ; 
We can do better than this (in terms of minimizing pp) by spreading out ai f ] 
the risk of corruption evenly among all the bits. In fact, we can achieve o mE g ie 7 ae 
Pb = Hy ‘(i —1/R), which is shown by the solid curve in figure 10.7. So, how R 


can this optimum be achieved? 

We reuse a tool that we just developed, namely the (N, K) code for a 
noisy channel, and we turn it on its head, using the decoder to define a lossy 
compressor. Specifically, we take an excellent (N, K) code for the binary 
symmetric channel. Assume that such a code has a rate R’ = K/N, and that 
it is capable of correcting errors introduced by a binary symmetric channel 
whose transition probability is q. Asymptotically, rate-R’ codes exist that 
have R’ ~ 1 — Ho(q). Recall that, if we attach one of these capacity-achieving 
codes of length N to a binary symmetric channel then (a) the probability 
distribution over the outputs is close to uniform, since the entropy of the 
output is equal to the entropy of the source (N R’) plus the entropy of the 
noise (NHo(q)), and (b) the optimal decoder of the code, in this situation, 
typically maps a received vector of length N to a transmitted vector differing 
in qN bits from the received vector. 

We take the signal that we wish to send, and chop it into blocks of length N 
(yes, N, not K). We pass each block through the decoder, and obtain a shorter 
signal of length K bits, which we communicate over the noiseless channel. To 
decode the transmission, we pass the K bit message to the encoder of the 
original code. The reconstituted message will now differ from the original 
message in some of its bits — typically qN of them. So the probability of bit 
error will be pp = q. The rate of this lossy compressor is R = N/K = 1/R’ 
1/(1 — H2(pp)). 

Now, attaching this lossy compressor to our capacity-C' error-free commu- 
nicator, we have proved the achievability of communication up to the curve 
(pp, R) defined by: 


Figure 10.7. A simple bound on 
achievable points (R, pp), and 
Shannon’s bound. 








E C 
~ 1— H(p) 


For further reading about rate-distortion theory, see Gallager (1968), p. 451, 
or McEliece (2002), p. 75. 


R 





o (10.20) 


> 10.5 The non-achievable region (part 3 of the theorem) 


The source, encoder, noisy channel and decoder define a Markov chain: s—>x>y—>s 
P(s,x,y, 8) = P(s)P(x| 8) Ply |x) P(6|y) (10.21) 


The data processing inequality (exercise 8.9, p.141) must apply to this chain: 
I(s; 8) < I(x; y). Furthermore, by the definition of channel capacity, I(x; y) < 
NC, so I(s; 8) < NC. 

Assume that a system achieves a rate R and a bit error probability pp; 
then the mutual information I(s; 8) is > NR(1 — Ho(pp)). But I(s;s) > NC 


is not achievable, so R > ———~ is not achievable. o 
1- H2(p») 





Exercise 10.1.19] Fill in the details in the preceding argument. If the bit errors 
between ŝ and s are independent then we have I (s; §) = NR(1— Hə(p»)). 
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What if we have complex correlations among those bit errors? Why does 
the inequality I(s;§) > NR(1 — Ho(pp)) hold? 


> 10.6 Computing capacity 


We have proved that the capacity of a channel is the maximum rate at which Sections 10.6-10.8 contain 
reliable communication can be achieved. How can we compute the capacity of advanced material. The first-time 
a given discrete memoryless channel? We need to find its optimal input distri- 7¢@der is encouraged to skip to 
bution. In general we can find the optimal input distribution by a computer section 10:9 (p112); 

search, making use of the derivative of the mutual information with respect 

to the input probabilities. 


> Exercise 10.2.!?] Find the derivative of I(X;Y) with respect to the input prob- 
ability p;, OI(X; Y )/ðp;, for a channel with conditional probabilities Q ae 


Exercise 10.3.!7] Show that I(X;Y) is a concave ~ function of the input prob- 
ability vector p. 


Since I(X;Y) is concave > in the input distribution p, any probability distri- 
bution p at which I(X;Y) is stationary must be a global maximum of I(X;Y). 
So it is tempting to put the derivative of I(X;Y) into a routine that finds a 
local maximum of I(X;Y), that is, an input distribution P(x) such that 


OI(X:Y) 


=X forall 4, (10.22) 
Opi 


where A is a Lagrange multiplier associated with the constraint J`; p; = 1. 
However, this approach may fail to find the right answer, because I(X;Y) 
might be maximized by a distribution that has p;=0 for some inputs. A 
simple example is given by the ternary confusion channel. 


Ternary confusion channel. Ax = {0,?,1}. Ay = {0,1}. 


0—0 
> P(y=0|x=0) = 1; P(y=0|a=7) = 1/2; P(y=0|x=1) = 0; 
1 P(y=1|e=0) = 0; P(y= =?) = 1/2; P= = 1. 


a 
2 
= 
8 
l 
a 
| 


Whenever the input ? is used, the output is random; the other inputs 
are reliable inputs. The maximum information rate of 1 bit is achieved 
by making no use of the input ?. 


> Exercise 10.4.1 P13] Sketch the mutual information for this channel as a 
function of the input distribution p. Pick a convenient two-dimensional 
representation of p. 


The optimization routine must therefore take account of the possibility that, 
as we go up hill on I(X;Y), we may run into the inequality constraints p; > 0. 


> Exercise 10.5.1% P-!74] Describe the condition, similar to equation (10.22), that 
is satisfied at a point where I(X;Y) is maximized, and describe a com- 
puter program for finding the capacity of a channel. 
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Results that may help in finding the optimal input distribution 


1. All outputs must be used. 


2. I(X;Y) is a convex ~ function of the channel parameters. Reminder: The term ‘convex ~’ 
means ‘convex’, and the term 
3. There may be several optimal input distributions, but they all look the ‘concave ~’ means ‘concave’; the 
same at the output. little smile and frown symbols are 


included simply to remind you 


> Exercise 10.6.!7] Prove that no output y is unused by an optimal input distri- What come and concave ean: 


bution, unless it is unreachable, that is, has Q(y| x) = 0 for all z. 
Exercise 10.7.!7! Prove that I(X;Y) is a convex ~ function of Q(y| x). 


Exercise 10.8.!] Prove that all optimal input distributions of a channel have 
the same output probability distribution P(y) = >>, P(«)Q(y| x). 


These results, along with the fact that I(X;Y) is a concave ~ function of 
the input probability vector p, prove the validity of the symmetry argument 
that we have used when finding the capacity of symmetric channels. If a 
channel is invariant under a group of symmetry operations — for example, 
interchanging the input symbols and interchanging the output symbols — then, 
given any optimal input distribution that is not symmetric, i.e., is not invariant 
under these operations, we can create another input distribution by averaging 
together this optimal input distribution and all its permuted forms that we 
can make by applying the symmetry operations to the original optimal input 
distribution. The permuted distributions must have the same I(X;Y) as the 
original, by symmetry, so the new input distribution created by averaging 
must have I(X;Y) bigger than or equal to that of the original distribution, 
because of the concavity of I. 


Symmetric channels 


In order to use symmetry arguments, it will help to have a definition of a 
symmetric channel. I like Gallager’s (1968) definition. 


A discrete memoryless channel is a symmetric channel if the set of 
outputs can be partitioned into subsets in such a way that for each 
subset the matrix of transition probabilities has the property that each 
row (if more than 1) is a permutation of each other row and each column 
is a permutation of each other column. 


Example 10.9. This channel 


2 
< 

| 
© 
8 

| 
7 
| 
© 
“Nn 
23 
< 

| 


= = =0|x=1) = 0.1; 
P(y=?|x=0) = 0.2; P(y=?|x=1) = 0.2; (10.23) 
P(y=1|a=0) = 0.1; P(y=1|r=1) = 0.7. 


is a symmetric channel because its outputs can be partitioned into (0, 1) 
and ?, so that the matrix can be rewritten: 


P(y=0|x=0) = 0.7; P(y=0|#=1) = 0.1; 
P(y=1|x=0) = 0.1; P(y=1|r=1) = 0.7; (10.24) 
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Symmetry is a useful property because, as we will see in a later chapter, 
communication at capacity can be achieved over symmetric channels by linear 
codes. 


Exercise 10.10.!#! Prove that for a symmetric channel with any number of 
inputs, the uniform distribution over the inputs is an optimal input 
distribution. 


> Exercise 10.11.17 P-!74] Are there channels that are not symmetric whose op- 
timal input distributions are uniform? Find one, or prove there are 
none. 


> 10.7 Other coding theorems 


The noisy-channel coding theorem that we proved in this chapter is quite gen- 
eral, applying to any discrete memoryless channel; but it is not very specific. 


The theorem only says that reliable communication with error probability e€ E,(R) 
and rate R can be achieved by using codes with sufficiently large blocklength C 
N. The theorem does not say how large N needs to be to achieve given values R 
of R and e. 
Presumably, the smaller € is and the closer R is to C, the larger N has to Figure 10.8. A typical 
be. random-coding exponent. 


Noisy-channel coding theorem — version with explicit N-dependence 


For a discrete memoryless channel, a blocklength N and a rate R, 
there exist block codes of length N whose average probability of 
error satisfies: 

pp < exp[-NE,(R)] (10.25) 


where E,(R) is the random-coding exponent of the channel, a 
convex ~, decreasing, positive function of R for 0 < R < C. The 
random-coding exponent is also known as the reliability function. 


[By an expurgation argument it can also be shown that there exist 
block codes for which the maximal probability of error ppy is also 
exponentially small in N.] 


The definition of E,() is given in Gallager (1968), p. 139. E,(R) approaches 
zero as R — C; the typical behaviour of this function is illustrated in fig- 
ure 10.8. The computation of the random-coding exponent for interesting 
channels is a challenging task on which much effort has been expended. Even 
for simple channels like the binary symmetric channel, there is no simple ex- 
pression for E,(R). 


Lower bounds on the error probability as a function of blocklength 


The theorem stated above asserts that there are codes with pp smaller than 
exp |—NE,(R)]. But how small can the error probability be? Could it be 
much smaller? 


For any code with blocklength N on a discrete memoryless channel, 
the probability of error assuming all source messages are used with 
equal probability satisfies 


pp Z exp[-N Es (R)], (10.26) 
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where the function FEsp( R), the sphere-packing exponent of the 
channel, is a convex ~, decreasing, positive function of R for 0 < 


R<C. 


For a precise statement of this result and further references, see Gallager 
(1968), p. 157. 


> 10.8 Noisy-channel coding theorems and coding practice 


Imagine a customer who wants to buy an error-correcting code and decoder 
for a noisy channel. The results described above allow us to offer the following 
service: if he tells us the properties of his channel, the desired rate R and the 
desired error probability pp, we can, after working out the relevant functions 
C, E,(R), and Esp(R), advise him that there exists a solution to his problem 
using a particular blocklength N; indeed that almost any randomly chosen 
code with that blocklength should do the job. Unfortunately we have not 
found out how to implement these encoders and decoders in practice; the cost 
of implementing the encoder and decoder for a random code with large N 
would be exponentially large in N. 

Furthermore, for practical purposes, the customer is unlikely to know ex- 
actly what channel he is dealing with. So Berlekamp (1980) suggests that 
the sensible way to approach error-correction is to design encoding-decoding 
systems and plot their performance on a variety of idealized channels as a 
function of the channel’s noise level. These charts (one of which is illustrated 
on page 568) can then be shown to the customer, who can choose among the 
systems on offer without having to specify what he really thinks his channel 
is like. With this attitude to the practical problem, the importance of the 
functions F(R) and Esp(R) is diminished. 


> 10.9 Further exercises 


Exercise 10.12.!7] A binary erasure channel with input x and output y has 
lle transition probability matrix: 


1l-q 0 0~+0 
Q= q q ne 


Find the mutual information I(X;Y) between the input and output for 
general input distribution {po, pı}, and show that the capacity of this 
channel is C = 1 — q bits. 


A Z channel has transition probability matrix: 


1 0-40 
y= | 0 1-4 | L 
1—1 
Show that, using a (2,1) code, two uses of a Z channel can be made to 
emulate one use of an erasure channel, and state the erasure probability 


of that erasure channel. Hence show that the capacity of the Z channel, 
Cz, satisfies Cz > 4(1— q) bits. 


Explain why the result Cz > z(1 — q) is an inequality rather than an 
equality. 
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Exercise 10.13.19 p- 174] A transatlantic cable contains N = 20 indistinguish- 
able electrical wires. You have the job of figuring out which wire is 
which, that is, to create a consistent labelling of the wires at each end. 
Your only tools are the ability to connect wires to each other in groups 
of two or more, and to test for connectedness with a continuity tester. 
What is the smallest number of transatlantic trips you need to make, 
and how do you do it? 


How would you solve the problem for larger N such as N = 1000? 


As an illustration, if N were 3 then the task can be solved in two steps 
by labelling one wire at one end a, connecting the other two together, 
crossing the Atlantic, measuring which two wires are connected, labelling 
them b and c and the unconnected one a, then connecting b to a and 
returning across the Atlantic, whereupon on disconnecting b from c, the 
identities of b and c can be deduced. 


This problem can be solved by persistent search, but the reason it is 
posed in this chapter is that it can also be solved by a greedy approach 
based on maximizing the acquired information. Let the unknown per- 
mutation of wires be x. Having chosen a set of connections of wires C at 
one end, you can then make measurements at the other end, and these 
measurements y convey information about x. How much? And for what 
set of connections is the information that y conveys about « maximized? 


»> 10.10 Solutions 


Solution to exercise 10.4 (p.169). If the input distribution is p = (po, p?, p1), 
the mutual information is 





I(X;Y) = H(Y) — H(Y|X) = H2(po + p2/2) — pr. (10.27) K i 
We can build a good sketch of this function in two ways: by careful inspection A 1/2 
of the function, or by looking at special cases. i 1/2 


For the plots, the two-dimensional representation of p I will use has pọ and 
pı as the independent variables, so that p = (po, p?, p1) = (po, (1—po— p1), p1). 


By inspection. If we use the quantities p, = po + p?/2 and pọ as our two 
degrees of freedom, the mutual information becomes very simple: I(X;Y) = 
A2(p.) — pr. Converting back to po = ps — p?/2 and pı = 1 — px — p?/2, 
we obtain the sketch shown at the left below. This function is like a tunnel 
rising up the direction of increasing po and pı. To obtain the required plot of 
I(X;Y) we have to strip away the parts of this tunnel that live outside the 
feasible simplex of probabilities; we do this by redrawing the surface, showing 
only the parts where po > 0 and pı > 0. A full plot of the function is shown 
at the right. 
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Special cases. In the special case p? = 0, the channel is a noiseless binary 
channel, and I(X;Y) = Ho(po). 

In the special case po = pi, the term Hə(po + p?/2) is equal to 1, so 
I(X;Y) =1-— p. 

In the special case pọ = 0, the channel is a Z channel with error probability sel y PA 
0.5. We know how to sketch that, from the previous chapter (figure 9.3). A 1 

These special cases allow us to construct the skeleton shown in figure 10.9. o Z / V 0.5 
Solution to exercise 10.5 (p.169). Necessary and sufficient conditions for p to j rs Po 1 
miscamuze.1(% 5X) are Figure 10.9. Skeleton of the 
al(X:Y) 


= \ and p;>0 mutual information for the 
Ce 4 for all i, (10.28) ternary confusion channel. 
oe) < A and p;=0 


where A is a constant related to the capacity by C = A + logs e. 

This result can be used in a computer program that evaluates the deriva- 
tives, and increments and decrements the probabilities p; in proportion to the 
differences between those derivatives. 

This result is also useful for lazy human capacity-finders who are good 
guessers. Having guessed the optimal input distribution, one can simply con- 
firm that equation (10.28) holds. 


Solution to exercise 10.11 (p.171). We certainly expect nonsymmetric chan- 
nels with uniform optimal input distributions to exist, since when inventing a 
channel we have I(J — 1) degrees of freedom whereas the optimal input dis- 
tribution is just (I — 1)-dimensional; so in the (J —1)-dimensional space of 
perturbations around a symmetric channel, we expect there to be a subspace 
of perturbations of dimension I(.J — 1) — (I — 1) = I(J — 2) + 1 that leave the 
optimal input distribution unchanged. 
Here is an explicit example, a bit like a Z channel. 


0.9585 0.0415 0.35 0.0 

_ | 0.0415 0.9585 0.0 0.35 
Q= 0 0 0.65 0 

0 0 0 0.65 


(10.29) 


Solution to exercise 10.13 (p.173). The labelling problem can be solved for 
any N > 2 with just two trips, one each way across the Atlantic. 

The key step in the information-theoretic approach to this problem is to 
write down the information content of one partition, the combinatorial object 
that is the connecting together of subsets of wires. If N wires are grouped 
together into gı subsets of size 1, go subsets of size 2, ..., then the number of 


such partitions is 


N! 
Q= —— (10.30) 


TL @)® gt 


$ 
and the information content of one such partition is the log of this quantity. 
In a greedy strategy we choose the first partition to maximize this information 
content. 

One game we can play is to maximize this information content with re- 
spect to the quantities gr, treated as real numbers, subject to the constraint 
>. grr = N. Introducing a Lagrange multiplier À for the constraint, the 
derivative is 








L (2+ aD a] = — logr! — log gr + Ar, (10.31) 


10.10: Solutions 


which, when set to zero, leads to the rather nice expression 


er” 


—; (10.32) 


Jr = 1 
Tr. 


the optimal g, is proportional to a Poisson distribution! We can solve for the 
Lagrange multiplier by plugging gr into the constraint )7,.g-r = N, which 
gives the implicit equation 

N= wpe", (10.33) 


where u = eò is a convenient reparameterization of the Lagrange multiplier. 


Figure 10.10a shows a graph of u(N); figure 10.10b shows the deduced non- 
integer assignments gr when u = 2.2, and nearby integers gr = {1,2,2,1,1} 
that motivate setting the first partition to (a)(bc) (de) (fgh) (ijk) (lmno)(pqrst). 

This partition produces a random partition at the other end, which has an 
information content of log Q = 40.4 bits, which is a lot more than half the total 
information content we need to acquire to infer the transatlantic permutation, 
log 20! ~ 61bits. [In contrast, if all the wires are joined together in pairs, 
the information content generated is only about 29 bits.] How to choose the 
second partition is left to the reader. A Shannonesque approach is appropriate, 
picking a random partition at the other end, using the same {gr}; you need 
to ensure the two partitions are as unlike each other as possible. 

If N # 2, 5 or 9, then the labelling problem has solutions that are 
particularly simple to implement, called Knowlton—Graham partitions: par- 
tition {1,...,N} into disjoint sets in two ways A and B, subject to the 
condition that at most one element appears both in an A set of cardinal- 
ity j and in a B set of cardinality k, for each j and k (Graham, 1966; 
Graham and Knowlton, 1968). 
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“y e345 67 8 9 10 
Figure 10.10. Approximate 
solution of the cable-labelling 
problem using Lagrange 
multipliers. (a) The parameter p 
as a function of N; the value 
(20) = 2.2 is highlighted. (b) 
Non-integer values of the function 
Gr = L"/r! are shown by lines and 
integer values of gr motivated by 
those non-integer values are 
shown by crosses. 
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About Chapter 11 


Before reading Chapter 11, you should have read Chapters 9 and 10. 
You will also need to be familiar with the Gaussian distribution. 


One-dimensional Gaussian distribution. If a random variable y is Gaus- 
sian and has mean p and variance 07, which we write: 


y ~ Normal(p,07), or P(y) = Normal(y; u, 0°), (11.1) 


then the distribution of y is: 


P(y| u, 07) = 





er [—(y — 1)? /20°]. (11.2) 


[I use the symbol P for both probability densities and probabilities.] 


The inverse-variance T = Yo? is sometimes called the precision of the 
Gaussian distribution. 


Multi-dimensional Gaussian distribution. If y = (y1,y2,...,yn) has a 
multivariate Gaussian distribution, then 





1 1 
PUylxA)=qayer(-sy-s'A-x)), 3) 
where x is the mean of the distribution, A is the inverse of the 
variance-covariance matrix, and the normalizing constant is Z(A) = 


(det(A/2m))71/?. 


This distribution has the property that the variance X of yi, and the 
covariance Xij of y; and yj are given by 





Xij = El(ys — Hs) (ys — 9s) = AG’ (11.4) 


where A`! is the inverse of the matrix A. 


The marginal distribution P(y;) of one component y; is Gaussian; 
the joint marginal distribution of any subset of the components is 
multivariate-Gaussian; and the conditional density of any subset, given 
the values of another subset, for example, P(y; | y;), is also Gaussian. 
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11 


Error-Correcting Codes & Real Channels 


The noisy-channel coding theorem that we have proved shows that there exist 
reliable error-correcting codes for any noisy channel. In this chapter we address 
two questions. 

First, many practical channels have real, rather than discrete, inputs and 
outputs. What can Shannon tell us about these continuous channels? And 
how should digital signals be mapped into analogue waveforms, and vice versa? 

Second, how are practical error-correcting codes made, and what is 
achieved in practice, relative to the possibilities proved by Shannon? 


> 11.1 The Gaussian channel 


The most popular model of a real-input, real-output channel is the Gaussian 
channel. 


The Gaussian channel has a real input x and a real output y. The condi- 
tional distribution of y given x is a Gaussian distribution: 





P(y| x£) = exp |- (y — x)? /20°] . (11.5) 


1 
V 270? 
This channel has a continuous input and output but is discrete in time. 
We will show below that certain continuous-time channels are equivalent 
to the discrete-time Gaussian channel. 


This channel is sometimes called the additive white Gaussian noise 


(AWGN) channel. 


As with discrete channels, we will discuss what rate of error-free information 
communication can be achieved over this channel. 


Motivation in terms of a continuous-time channel 


Consider a physical (electrical, say) channel with inputs and outputs that are 
continuous in time. We put in x(t), and out comes y(t) = x(t) + n(t). 

Our transmission has a power cost. The average power of a transmission 
of length T may be constrained thus: 


fa [e(t]? /T < P. (11.6) 
0 


The received signal is assumed to differ from x(t) by additive noise n(t) (for 
example Johnson noise), which we will model as white Gaussian noise. The 
magnitude of this noise is quantified by the noise spectral density, No. 
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How could such a channel be used to communicate information? Consider 
transmitting a set of N real numbers {x,})_, in a signal of duration T made di(t) OS 
up of a weighted combination of orthonormal basis functions ¢,(t), 


N 
x(t) = X` tada@), (11.7) 
n=1 
where f dt ġn(t)m(t) = dnm. The receiver can then compute the scalars: 


T: T: 
wn = f dt ønlt)ylt) = tn + i: dt (t)n(t) (11.8) 
= IntMNn (11.9) 








for n =1...N. If there were no noise, then yn would equal zn. The white 
Gaussian noise n(t) adds scalar noise np to the estimate yn. This noise is Figure 11.1. Three basis functions, 
Gaussian: and a weighted combination of 
` N : 
~N 10. Na /2 11.1 them, x(t) = X n-1 nn (t), with 
na Normal), No/2); TNs SoA = Nr and wees: 
where No is the spectral density introduced above. Thus a continuous chan- 
nel used in this way is equivalent to the Gaussian channel defined at equa- 
tion (11.5). The power constraint hee dt [a(t)]? < PT defines a constraint on 


the signal amplitudes £n, 


— _ PT 
S02 < PT = T2 < —. (11.11) 
m N 
Before returning to the Gaussian channel, we define the bandwidth (mea- 
sured in Hertz) of the continuous channel to be: 


NS 

W = or (11.12) 
where N™* is the maximum number of orthonormal functions that can be 
produced in an interval of length T. This definition can be motivated by 
imagining creating a band-limited signal of duration T from orthonormal co- 
sine and sine curves of maximum frequency W. The number of orthonormal 
functions is N™** = 2WT. This definition relates to the Nyquist sampling 
theorem: if the highest frequency present in a signal is W, then the signal 
can be fully determined from its values at a series of discrete sample points 
separated by the Nyquist interval At = !/2W seconds. 

So the use of a real continuous channel with bandwidth W, noise spectral 
density No, and power P is equivalent to N/T = 2W uses per second of a 
Gaussian channel with noise level o? = No/2 and subject to the signal power 
constraint xe < Phw. 


Definition of Ey /No 


Imagine that the Gaussian channel yn = Ln + Nn is used with an encoding 
system to transmit binary source bits at a rate of R bits per channel use. How 
can we compare two encoding systems that have different rates of communi- 
cation R and that use different powers x2? Transmitting at a large rate R is 
good; using small power is good too. 

It is conventional to measure the rate-compensated signal-to-noise ratio by 
the ratio of the power per source bit Ep = x2/ R to the noise spectral density 


No: Ep /No is dimensionless, but it is 
x2 usually reported in the units of 
Ep/No = ZAR (11.13) decibels; the value given is 
10 logio Er /No. 


E» /No is one of the measures used to compare coding schemes for Gaussian 
channels. 
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> 11.2 Inferring the input to a real channel 


‘The best detection of pulses’ 


In 1944 Shannon wrote a memorandum (Shannon, 1993) on the problem of 
best differentiating between two types of pulses of known shape, represented 
by vectors x9 and xı, given that one of them has been transmitted over a 
noisy channel. This is a pattern recognition problem. It is assumed that the 
noise is Gaussian with probability density Xo 


{ADDN "i 


in) = fact (£)] exp (tata (11.14) 
P(n) = |det | — exp (-5 n), 11.14 
27 2 X1 il al l, ll 


where A is the inverse of the variance—covariance matrix of the noise, a sym- 
metric and positive-definite matrix. (If A is a multiple of the identity matrix, y | | 
I/o?, then the noise is ‘white’. For more general A, the noise is ‘coloured’.) | = 
The probability of the received vector y given that the source signal was s 











| 











(either zero or one) is then Figure 11.2. Two pulses x9 and 
X 1, represented as 31-dimensional 
AN]? 1 e vectors, and a noisy version of one 
P(y|s)= ace (*)| exp (-30 — x;)'A(y — x,)) i (11.15) of them, y. 


The optimal detector is based on the posterior probability ratio: 


P(s=1|y) _ Ply|s=1) P(s=1) 


P(s=0ly) ~ Ply|s=0) P(s=0) (11.16) 


= exp -i0 —x,)' Aly —x1) + Z0 = xo)"A(y = xo) +n PETZ) 
= exp(y'A(x; — Xo) +84), (11.17) 


where @ is a constant independent of the received vector y, 


P(s=1) 


POD (11.18) 


1 1 
0= -7X1Axı + 5 X0AXo +1n 


If the detector is forced to make a decision (i.e., guess either s=1 or s =0) then 


the decision that minimizes the probability of error is to guess the most prob- wei ine tl ill 
able hypothesis. We can write the optimal decision in terms of a discriminant l 
function: f l 
a(y) = y'A (xı — xo) +0 e ee 
with the decisions discriminate between xg and x}. 
aly) >0 — guess s=1 
aly) <0 — guess s=0 (11.20) 


a(y) =0 — guess either. 
Notice that a(y) is a linear function of the received vector, 
aly) = w'y +9, (11.21) 


where w = A(xı — Xo). 


»> 11.3 Capacity of Gaussian channel 


Until now we have measured the joint, marginal, and conditional entropy 
of discrete variables only. In order to define the information conveyed by 
continuous variables, there are two issues we must address — the infinite length 
of the real line, and the infinite precision of real numbers. 
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Infinite inputs 


How much information can we convey in one use of a Gaussian channel? If 
we are allowed to put any real number z into the Gaussian channel, we could 
communicate an enormous string of N digits djdgd3...dy by setting x = 
didod3...dyj000...000. The amount of error-free information conveyed in 
just a single transmission could be made arbitrarily large by increasing N, 
and the communication could be made arbitrarily reliable by increasing the 
number of zeroes at the end of x. There is usually some power cost associated 
with large inputs, however, not to mention practical limits in the dynamic 
range acceptable to a receiver. It is therefore conventional to introduce a 
cost function v(x) for every input x, and constrain codes to have an average 
cost U less than or equal to some maximum value. A generalized channel 
coding theorem, including a cost function for the inputs, can be proved — see 
McEliece (1977). The result is a channel capacity C(v) that is a function of 
the permitted cost. For the Gaussian channel we will assume a cost 


v(x) = z? (11.22) 


such that the ‘average power’ x? of the input is constrained. We motivated this 
cost function above in the case of real electrical channels in which the physical 
power consumption is indeed quadratic in x. The constraint x2 = 0 makes 
it impossible to communicate infinite information in one use of the Gaussian 
channel. 


Infinite precision 


It is tempting to define joint, marginal, and conditional entropies for real 
variables simply by replacing summations by integrals, but this is not a well 
defined operation. As we discretize an interval into smaller and smaller divi- 
sions, the entropy of the discrete distribution diverges (as the logarithm of the 
granularity) (figure 11.4). Also, it is not permissible to take the logarithm of 
a dimensional quantity such as a probability density P(x) (whose dimensions 
are [z]~). 

There is one information measure, however, that has a well-behaved limit, 
namely the mutual information — and this is the one that really matters, since 
it measures how much information one variable conveys about another. In the 
discrete case, 


Fa i (11.23) 


I(X;Y) = X. P(x,y) log POPO 


Now because the argument of the log is a ratio of two probabilities over the 
same space, it is OK to have P(x,y), P(x) and P(y) be probability densities 
and replace the sum by an integral: 


Vy S lirip e 
I(X;Y) = fa dy P(e, y) los SSPG) (11.24) 
ss P(y|) 


We can now ask these questions for the Gaussian channel: (a) what probability 
distribution P(x) maximizes the mutual information (subject to the constraint 
x? = v)? and (b) does the maximal mutual information still measure the 
maximum error-free communication rate of this real channel, as it did for the 
discrete channel? 
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Figure 11.4. (a) A probability 
density P(x). Question: can we 
define the ‘entropy’ of this 
density? (b) We could evaluate 
the entropies of a sequence of 
probability distributions with 
decreasing grain-size g, but these 
entropies tend to 


1 
P(x) log dx, which is not 
| P(x)g 
independent of g: the entropy 


goes up by one bit for every 
halving of g. 


























1 : : 
[ro log Pa) dz is an illegal 
integral. 
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Exercise 11.1.1 P-189] Prove that the probability distribution P(x) that max- 
imizes the mutual information (subject to the constraint x? = v) is a 
Gaussian distribution of mean zero and variance v. 


> Exercise 11.2.1% P-189] Show that the mutual information I(X;Y), in the case 
of this optimized distribution, is 


C= log (1+ Z) (11.26) 


This is an important result. We see that the capacity of the Gaussian channel 
is a function of the signal-to-noise ratio v/o?. 


Inferences given a Gaussian input distribution 


If P(x) = Normal(z;0,v) and P(y|x) = Normal(y; 2,07) then the marginal 
distribution of y is P(y) = Normal(y; 0, v+0?) and the posterior distribution 
of the input, given that the output is y, is: 

P(a|y) x Ply|x)P(x) (11.27) 
x exp(—(y — x)?/207) exp(—a?/2v) (11.28) 


v fe, eae 


[The step from (11.28) to (11.29) is made by completing the square in the 
exponent.] This formula deserves careful study. The mean of the posterior 
distribution, ree y, can be viewed as a weighted combination of the value 


that best fits the output, x = y, and the value that best fits the prior, x = 0: 


R 


v 1/0? 1/v 
zy 5 5 Yt 5 0. 
uto 1/v + 1/0 1/v + 1/0 








(11.30) 


The weights 1/o? and 1/v are the precisions of the two Gaussians that we 
multiplied together in equation (11.28): the prior and the likelihood. 

The precision of the posterior distribution is the sum of these two pre- 
cisions. This is a general property: whenever two independent sources con- 
tribute information, via Gaussian distributions, about an unknown variable, 
the precisions add. [This is the dual to the better-known relationship ‘when 
independent variables are added, their variances add’.] 


Noisy-channel coding theorem for the Gaussian channel 


We have evaluated a maximal mutual information. Does it correspond to a 
maximum possible rate of error-free information transmission? One way of 
proving that this is so is to define a sequence of discrete channels, all derived 
from the Gaussian channel, with increasing numbers of inputs and outputs, 
and prove that the maximum mutual information of these channels tends to the 
asserted C. The noisy-channel coding theorem for discrete channels applies 
to each of these derived channels, thus we obtain a coding theorem for the 
continuous channel. Alternatively, we can make an intuitive argument for the 
coding theorem specific for the Gaussian channel. 
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Geometrical view of the noisy-channel coding theorem: sphere packing 


Consider a sequence x = (#1,...,2y) of inputs, and the corresponding output 
y, as defining two points in an N dimensional space. For large N, the noise 
power is very likely to be close (fractionally) to No?. The output y is therefore 
very likely to be close to the surface of a sphere of radius V No? centred on x. 
Similarly, if the original signal x is generated at random subject to an average 
power constraint «2 = v, then x is likely to lie close to a sphere, centred 
on the origin, of radius v Nv; and because the total average power of y is 
v+o?, the received signal y is likely to lie on the surface of a sphere of radius 
N(v +7), centred on the origin. 
The volume of an N-dimensional sphere of radius r is 


V(r, N) = F ale N 


Now consider making a communication system based on non-confusable 
inputs x, that is, inputs whose spheres do not overlap significantly. The max- 
imum number S of non-confusable inputs is given by dividing the volume of 
the sphere of probable ys by the volume of the sphere for y given x: 


N 
N(v + o?) 
S< (ma) (11.32) 


Thus the capacity is bounded by: 
1 1 v 
C= log M < 5log (1+). (11.33) 


A more detailed argument like the one used in the previous chapter can es- 
tablish equality. 


Back to the continuous channel 


Recall that the use of a real continuous channel with bandwidth W, noise 
spectral density No and power P is equivalent to N/T = 2W uses per second of 
a Gaussian channel with o? = No/2 and subject to the constraint T2 < P/2W. 
Substituting the result for the capacity of the Gaussian channel, we find the 
capacity of the continuous channel to be: 


P 
C = W log (1 + nar) bits per second. (11.34) 


This formula gives insight into the tradeoffs of practical communication. Imag- 
ine that we have a fixed power constraint. What is the best bandwidth to make 
use of that power? Introducing Wp = P/No, i.e., the bandwidth for which the 14 pe aR 





signal-to-noise ratio is 1, figure 11.5 shows C/Wo = W/Wolog(1 + Wo/W) as z Ba 

a function of W/Wọ. The capacity increases to an asymptote of Wo loge. It 8 ae 

is dramatically better (in terms of capacity for fixed power) to transmit at a E oa 4 

low signal-to-noise ratio over a large bandwidth, than with high signal-to-noise i | i en 
in a narrow bandwidth; this is one motivation for wideband communication o 1 2 3 4 5 6 


methods such as the ‘direct sequence spread-spectrum’ approach used in 3G ses 


mobile phones. Of course, you are not alone, and your electromagnetic neigh- Figure 11.5. Capacity versus 
bours may not be pleased if you use a large bandwidth, so for social reasons, bandwidth for a real channel: 


engineers often have to make do with higher-power, narrow-bandwidth trans- C/Wo = W/Wo log (1 + Wo/W) 
mitters. as a function of W/Wo. 
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> 11.4 What are the capabilities of practical error-correcting codes? 


Nearly all codes are good, but nearly all codes require exponential look-up 
tables for practical implementation of the encoder and decoder — exponential 
in the blocklength N. And the coding theorem required N to be large. 

By a practical error-correcting code, we mean one that can be encoded 
and decoded in a reasonable amount of time, for example, a time that scales 
as a polynomial function of the blocklength N — preferably linearly. 


The Shannon limit is not achieved in practice 


The non-constructive proof of the noisy-channel coding theorem showed that 
good block codes exist for any noisy channel, and indeed that nearly all block 
codes are good. But writing down an explicit and practical encoder and de- 
coder that are as good as promised by Shannon is still an unsolved problem. 


Very good codes. Given a channel, a family of block codes that achieve 
arbitrarily small probability of error at any communication rate up to 
the capacity of the channel are called ‘very good’ codes for that channel. 


Good codes are code families that achieve arbitrarily small probability of 
error at non-zero communication rates up to some maximum rate that 
may be less than the capacity of the given channel. 


Bad codes are code families that cannot achieve arbitrarily small probability 
of error, or that can achieve arbitrarily small probability of error only by 
decreasing the information rate to zero. Repetition codes are an example 
of a bad code family. (Bad codes are not necessarily useless for practical 
purposes.) 


Practical codes are code families that can be encoded and decoded in time 
and space polynomial in the blocklength. 


Most established codes are linear codes 


Let us review the definition of a block code, and then add the definition of a 
linear block code. 


An (N,K) block code for a channel Q is a list of S = 2% codewords 
{x x... xO, each of length N: x) € AÑ. The signal to be 
encoded, s, which comes from an alphabet of size 2%, is encoded as x“). 


A linear (N, K) block code is a block code in which the codewords {x“)} 
make up a k-dimensional subspace of AN . The encoding operation can 
be represented by an N x K binary matrix G' such that if the signal to 
be encoded, in binary notation, is s (a vector of length K bits), then the 
encoded signal is t = G's modulo 2. 


The codewords {t} can be defined as the set of vectors satisfying Ht = 
Omod 2, where H is the parity-check matrix of the code. 


For example the (7,4) Hamming code of section 1.2 takes K = 4 signal 
bits, s, and transmits them followed by three parity-check bits. The N = 7 eyes 


transmitted symbols are given by G's mod 2. ate |: I 
Coding theory was born with the work of Hamming, who invented a fam- ~ |111- 
ily of practical error-correcting codes, each able to correct one error in a cay 


block of length N, of which the repetition code R3 and the (7,4) code are 
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the simplest. Since then most established codes have been generalizations of 
Hamming’s codes: Bose-Chaudhury—Hocquenhem codes, Reed—Miiller codes, 
Reed-Solomon codes, and Goppa codes, to name a few. 


Convolutional codes 


Another family of linear codes are convolutional codes, which do not divide 
the source stream into blocks, but instead read and transmit bits continuously. 
The transmitted bits are a linear function of the past source bits. Usually the 
rule for generating the transmitted bits involves feeding the present source 
bit into a linear-feedback shift-register of length k, and transmitting one or 
more linear functions of the state of the shift register at each iteration. The 
resulting transmitted bit stream is the convolution of the source stream with 
a linear filter. The impulse-response function of this filter may have finite or 
infinite duration, depending on the choice of feedback shift-register. 
We will discuss convolutional codes in Chapter 48. 


Are linear codes ‘good’? 


One might ask, is the reason that the Shannon limit is not achieved in practice 
because linear codes are inherently not as good as random codes? The answer 
is no, the noisy-channel coding theorem can still be proved for linear codes, 
at least for some channels (see Chapter 14), though the proofs, like Shannon’s 
proof for random codes, are non-constructive. 

Linear codes are easy to implement at the encoding end. Is decoding a 
linear code also easy? Not necessarily. The general decoding problem (find 
the maximum likelihood s in the equation G's+n = r) is in fact NP-complete 
(Berlekamp et al., 1978). [NP-complete problems are computational problems 
that are all equally difficult and which are widely believed to require expo- 
nential computer time to solve in general.] So attention focuses on families of 
codes for which there is a fast decoding algorithm. 


Concatenation 


One trick for building codes with practical decoders is the idea of concatena- 





tion. 

An encoder-channel—decoder system C — Q — D can be viewed as defining CSCO SD SD) 
a super-channel Q’ with a smaller probability of error, and with complex . 
correlations among its errors. We can create an encoder C’ and decoder D’ for Q 


this super-channel Q’. The code consisting of the outer code C’ followed by 
the inner code C is known as a concatenated code. 

Some concatenated codes make use of the idea of interleaving. We read 
the data in blocks, the size of each block being larger than the blocklengths 
of the constituent codes C and C’. After encoding the data of one block using 
code C’, the bits are reordered within the block in such a way that nearby 
bits are separated from each other once the block is fed to the second code 
C. A simple example of an interleaver is a rectangular code or product code 
in which the data are arranged in a Kə x Kı block, and encoded horizontally 
using an (N1, K1) linear code, then vertically using a (N2, K2) linear code. 


> Exercise 11.3.19] Show that either of the two codes can be viewed as the inner 
code or the outer code. 


As an example, figure 11.6 shows a product code in which we encode 
first with the repetition code Rg (also known as the Hamming code H(3,1)) 
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Figure 11.6. A product code. (a) 
A string 1011 encoded using a 
concatenated code consisting of 
two Hamming codes, H(3, 1) and 
H(7,4). (b) a noise pattern that 
flips 5 bits. (c) The received 
vector. (d) After decoding using 
the horizontal (3,1) decoder, and 
(e) after subsequently using the 
vertical (7,4) decoder. The 
decoded vector matches the 
original. 

(d’, e’) After decoding in the other 
order, three errors still remain. 





























= = eje me me mejl o ole merme eR 


FPOCOrFRF Olle COO RRR RE 
FPOORrRFR FRI RFOOrFRF OF 


FOC RFF ETT ROOrRF OF 





FOODOrRRFOF]|FROORRF e&eR 


(d’) 


horizontally then with H(7,4) vertically. The blocklength of the concatenated 
code is 27. The number of source bits per codeword is four, shown by the 
small rectangle. 

We can decode conveniently (though not optimally) by using the individual 
decoders for each of the subcodes in some sequence. It makes most sense to 
first decode the code which has the lowest rate and hence the greatest error- 
correcting ability. 

Figure 11.6(c-e) shows what happens if we receive the codeword of fig- 
ure 11.6a with some errors (five bits flipped, as shown) and apply the decoder 
for H(3,1) first, and then the decoder for H(7,4). The first decoder corrects 
three of the errors, but erroneously modifies the third bit in the second row 
where there are two bit errors. The (7,4) decoder can then correct all three 
of these errors. 

Figure 11.6(d’—e’) shows what happens if we decode the two codes in the 
other order. In columns one and two there are two errors, so the (7,4) decoder 
introduces two extra errors. It corrects the one error in column 3. The (3, 1) 
decoder then cleans up four of the errors, but erroneously infers the second 


bit. 


Interleaving 


The motivation for interleaving is that by spreading out bits that are nearby 
in one code, we make it possible to ignore the complex correlations among the 
errors that are produced by the inner code. Maybe the inner code will mess 
up an entire codeword; but that codeword is spread out one bit at a time over 
several codewords of the outer code. So we can treat the errors introduced by 
the inner code as if they are independent. 


Other channel models 


In addition to the binary symmetric channel and the Gaussian channel, coding 
theorists keep more complex channels in mind also. 

Burst-error channels are important models in practice. Reed-Solomon 
codes use Galois fields (see Appendix C.1) with large numbers of elements 
(e.g. 216) as their input alphabets, and thereby automatically achieve a degree 
of burst-error tolerance in that even if 17 successive bits are corrupted, only 2 
successive symbols in the Galois field representation are corrupted. Concate- 
nation and interleaving can give further protection against burst errors. The 
concatenated Reed-Solomon codes used on digital compact discs are able to 
correct bursts of errors of length 4000 bits. 
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> Exercise 11.4.1% P189] The technique of interleaving, which allows bursts of 
errors to be treated as independent, is widely used, but is theoretically 
a poor way to protect data against burst errors, in terms of the amount 
of redundancy required. Explain why interleaving is a poor method, 
using the following burst-error channel as an example. Time is divided 
into chunks of length N = 100 clock cycles; during each chunk, there 
is a burst with probability b = 0.2; during a burst, the channel is a bi- 
nary symmetric channel with f = 0.5. If there is no burst, the channel 
is an error-free binary channel. Compute the capacity of this channel 
and compare it with the maximum communication rate that could con- 
ceivably be achieved if one used interleaving and treated the errors as 
independent. 


Fading channels are real channels like Gaussian channels except that the 
received power is assumed to vary with time. A moving mobile phone is an 
important example. The incoming radio signal is reflected off nearby objects 
so that there are interference patterns and the intensity of the signal received 
by the phone varies with its location. The received power can easily vary by 
10 decibels (a factor of ten) as the phone’s antenna moves through a distance 
similar to the wavelength of the radio signal (a few centimetres). 


> 11.5 The state of the art 


What are the best known codes for communicating over Gaussian channels? 
All the practical codes are linear codes, and are either based on convolutional 
codes or block codes. 


Convolutional codes, and codes based on them 


Textbook convolutional codes. The ‘de facto standard’ error-correcting 
code for satellite communications is a convolutional code with constraint 
length 7. Convolutional codes are discussed in Chapter 48. 


Concatenated convolutional codes. The above convolutional code can be 
used as the inner code of a concatenated code whose outer code is a Reed- 
Solomon code with eight-bit symbols. This code was used in deep space 
communication systems such as the Voyager spacecraft. For further 
reading about Reed-Solomon codes, see Lin and Costello (1983). 


Cı 

The code for Galileo. A code using the same format but using a longer | 
constraint length — 15 — for its convolutional code and a larger Reed- 
Solomon code was developed by the Jet Propulsion Laboratory (Swan- 


son, 1988). The details of this code are unpublished outside JPL, and the 
decoding is only possible using a room full of special-purpose hardware. 
In 1992, this was the best code known of rate 1/4. 


Figure 11.7. The encoder of a 
turbo code. Each box C1, C2, 
contains a convolutional code. 


: Stas oe ae The source bits are reordered 
Turbo codes. In 1993, Berrou, Glavieux and Thitimajshima reported work using a permutation 7 before they 


on turbo codes. The encoder of a turbo code is based on the encoders are fed to Cy. The transmitted 
of two convolutional codes. The source bits are fed into each encoder, codeword is obtained by 
the order of the source bits being permuted in a random way, and the = concatenating or interleaving the 


resulting parity bits from each constituent code are transmitted. outputs of the two convolutional 
codes. The random permutation 


The decoding algorithm involves iteratively decoding each constituent is chosen when the code is 
code using its standard decoding algorithm, then using the output of designed, and fixed thereafter. 
the decoder as the input to the other decoder. This decoding algorithm 
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is an instance of a message-passing algorithm called the sum—product 
algorithm. 


Turbo codes are discussed in Chapter 48, and message passing in Chap- , i , 4, 11 
ters 16, 17, 25, and 26. H=|, 1,1 i 5; 











Block codes 


Gallager’s low-density parity-check codes. The best block codes known 
for Gaussian channels were invented by Gallager in 1962 but were 
promptly forgotten by most of the coding theory community. They were 
rediscovered in 1995 and shown to have outstanding theoretical and prac- 
tical properties. Like turbo codes, they are decoded by message-passing 
algorithms. Figure 11.8. A low-density 

parity-check matrix and the 

corresponding graph of a rate- 1/4 

The performances of the above codes are compared for Gaussian channels low-density parity-check code 
in figure 47.17, p.568. with blocklength N = 16, and 
M =12 constraints. Each white 
circle represents a transmitted bit. 
> 11.6 Summary Each bit participates in j = 3 
constraints, represented by [+ 
Random codes are good, but they require exponential resources to encode squares. Each constraint forces 
and decode them. the sum of the k = 4 bits to which 
it is connected to be even. This 
Non-random codes tend for the most part not to be as good as random code is a (16,4) code. 
codes. For a non-random code, encoding may be easy, but even for Outstanding performance is 
simply-defined linear codes, the decoding problem remains very difficult. Obtained when the blocklength is 
increased to N ~ 10000. 







































































We will discuss these beautifully simple codes in Chapter 47. 














The best practical codes (a) employ very large block sizes; (b) are based 
on semi-random code constructions; and (c) make use of probability- 
based decoding algorithms. 


> 11.7 Nonlinear codes 


Most practically used codes are linear, but not all. Digital soundtracks are 
encoded onto cinema film as a binary pattern. The likely errors affecting the 
film involve dirt and scratches, which produce large numbers of 1s and Os 
respectively. We want none of the codewords to look like all-1s or all-Os, so 
that it will be easy to detect errors caused by dirt and scratches. One of the 
codes used in digital cinema sound systems is a nonlinear (8,6) code consisting 
of 64 of the ($) binary patterns of weight 4. 


> 11.8 Errors other than noise 


Another source of uncertainty for the receiver is uncertainty about the tim- 
ing of the transmitted signal x(t). In ordinary coding theory and infor- 
mation theory, the transmitter’s time t and the receiver’s time u are as- 
sumed to be perfectly synchronized. But if the receiver receives a signal 
y(u), where the receiver’s time, u, is an imperfectly known function u(t) 
of the transmitter’s time t, then the capacity of this channel for commu- 
nication is reduced. The theory of such channels is incomplete, compared 
with the synchronized channels we have discussed thus far. Not even the ca- 
pacity of channels with synchronization errors is known (Levenshtein, 1966; 
Ferreira et al., 1997); codes for reliable communication over channels with 
synchronization errors remain an active research area (Davey and MacKay, 
2001). 
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Further reading 


For a review of the history of spread-spectrum methods, see Scholtz (1982). 


> 11.9 Exercises 


The Gaussian channel 


> Exercise 11.5.1% P19] Consider a Gaussian channel with a real input x, and 
signal to noise ratio v/o?. 


(a) What is its capacity C? 





(b) If the input is constrained to be binary, x € {+v}, what is the 
capacity C” of this constrained channel? 


(c) If in addition the output of the channel is thresholded using the 
mapping 


, J1 y>0 
yov={ i V0; (11.35) 


what is the capacity C” of the resulting channel? 


(d) Plot the three capacities above as a function of v/o? from 0.1 to 2. 
[You'll need to do a numerical integral to evaluate C”.] 


> Exercise 11.6.19] For large integers K and N, what fraction of all binary error- 
correcting codes of length N and rate R = K/N are linear codes? [The 
answer will depend on whether you choose to define the code to be an 
ordered list of 2* codewords, that is, a mapping from s € {1,2,...,2"} 
to x‘), or to define the code to be an unordered list, so that two codes 
consisting of the same codewords are identical. Use the latter definition: 
a code is a set of codewords; how the encoder operates is not part of the 
definition of the code.] 


Erasure channels 


> Exercise 11.7.4] Design a code for the binary erasure channel, and a decoding 
algorithm, and evaluate their probability of error. [The design of good 
codes for erasure channels is an active research area (Spielman, 1996; 
Byers et al., 1998); see also Chapter 50.] 


> Exercise 11.9.5] Design a code for the q-ary erasure channel, whose input x is 
drawn from 0,1,2,3,...,(q—1), and whose output y is equal to x with 
probability (1 — f) and equal to ? otherwise. [This erasure channel is a 
good model for packets transmitted over the internet, which are either 
received reliably or are lost.] 


Exercise 11.9.'% P-190] How do redundant arrays of independent disks (RAID) 
work? These are information storage systems consisting of about ten [Some people say RAID stands for 
disk drives, of which any two or three can be disabled and the others are ‘redundant array of inexpensive 
able to still able to reconstruct any requested file. What codes are used, disks’, but I think that’s silly — 
and how far are these systems from the Shannon limit for the problem RAID ould) still bea goga id se 

; . even if the disks were expensive!] 

they are solving? How would you design a better RAID system? Some 
information is provided in the solution section. See http://www.acnc. 
com/raid2.html; see also Chapter 50. 
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> 11.10 Solutions 


Solution to exercise 11.1 (p.181). Introduce a Lagrange multiplier A for the 
power constraint and another, ju, for the constraint of normalization of P(x). 


F I(X;Y) — à f dx P(x)x? — u f dx P(x) (11.36) 


= fero | fwrw) In Pula) — àz? — u|. (11.37) 





Make the functional derivative with respect to P(x*). 











FA = Jw Pu e Ar”? 
- fax Piz) n FG aoa (11.38) 


The final factor 6P(y)/6P(a*) is found, using P(y) = [da P(x)P(y |x), to be 
P(y|«x*), and the whole of the last term ae in a puff of smoke to 1, 
which can be absorbed into the u term. 

Substitute P(y |£) = exp(—(y—2)?/207)/W 270? and set the derivative to 
zero: 





fù P(y|zx)ln Pla — x? — p' =0 (11.39) 
=> fù Lie ee 20") In[P(y)o] = —Az? — p — 5 (11.40) 


This condition must be satisfied by In[P(y)o] for all x. 

Writing a Taylor expansion of In[P(y)o] = at+by+cy?+---, only a quadratic 
function In[P(y)o] = a+ cy? would satisfy the constraint (11.40). (Any higher 
order terms y?, p > 2, would produce terms in x? that are not present on 
the right-hand side.) Therefore P(y) is Gaussian. We can obtain this optimal 
output distribution by using a Gaussian input distribution P(z). 


Solution to exercise 11.2 (p.181). Given a Gaussian input distribution of vari- 
ance v, the output distribution is Normal(0,v+o7), since x and the noise 
are independent random variables, and variances add for independent random 
variables. The mutual information is: 


TY) Jazdy P@)P(y|2) log Plu |x) - [av Poog Pu) (11.41) 


1 1 1 

= -log— — —log ——~ 11.42 
2 8 52 oP eee ( ) 
1 v 

= 5 log (1+) (11.43) 


Solution to exercise 11.4 (p.186). The capacity of the channel is one minus 
the information content of the noise that it adds. That information content is, 
per chunk, the entropy of the selection of whether the chunk is bursty, H2(b), 
plus, with probability b, the entropy of the flipped bits, N, which adds up 
to H2(b) + Nb per chunk (roughly; accurate if N is large). So, per bit, the 
capacity is, for N = 100, 


1 
C=1- (7700 + b) = 1 — 0.207 = 0.793. (11.44) 
In contrast, interleaving, which treats bursts of errors as independent, causes 


the channel to be treated as a binary symmetric channel with f = 0.2 x 0.5 = 
0.1, whose capacity is about 0.53. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


190 11 — Error-Correcting Codes and Real Channels 


Interleaving throws away the useful information about the correlated- 
ness of the errors. Theoretically, we should be able to communicate about 
(0.79/0.53) ~ 1.6 times faster using a code and decoder that explicitly treat 
bursts as bursts. 


Solution to exercise 11.5 (p.188). 


(a) Putting together the results of exercises 11.1 and 11.2, we deduce that 





























a Gaussian channel with real input x, and signal to noise ratio v/a? has 124 
capacity 14 ened 
1 v 0.8 4 ie 
C= slog(1+ 5). (11.45) oe] Vi 
2 o ae 
04 4 
(b) If the input is constrained to be binary, x € {+v}, the capacity is oe Nee 
achieved by using these two inputs with equal probability. The capacity 0 06 4. 48-228 
is reduced to a somewhat messy integral, 
14 Zn 
o0 (oe) OO 
ct = [dy N(y:0)logN(yi0) - fay Plog Pw), (1146) | 
—00 —00 o1 4 
where N(y;x) = (1/V2z) exp|(y — x)?/2], £ = Vv/o, and P(y) = A 
[N (y; x) + N (y; —x)]/2. This capacity is smaller than the unconstrained 0.0% ha - ; 
capacity (11.45), but for small signal-to-noise ratio, the two capacities ; 
are close in value. Figure 11.9. Capacities (from top 


to bottom in each graph) C, C”, 
(c) If the output is thresholded, then the Gaussian channel is turned into and C”, versus the signal-to-noise 


a binary symmetric channel whose transition probability is given by the ratio (,/v/o). The lower graph is 
error function ® defined on page 156. The capacity is a log-log plot. 


C" =1-— H,(f), where f = 6(,/v/c). (11.47) 


Solution to exercise 11.9 (p.188). There are several RAID systems. One of 
the easiest to understand consists of 7 disk drives which store data at rate 
4/7 using a (7,4) Hamming code: each successive four bits are encoded with 
the code and the seven codeword bits are written one to each disk. Two or 
perhaps three disk drives can go down and the others can recover the data. 
The effective channel model here is a binary erasure channel, because it is 
assumed that we can tell when a disk is dead. 

It is not possible to recover the data for some choices of the three dead 
disk drives; can you see why? 


> Exercise 11.10.17 P190] Give an example of three disk drives that, if lost, lead 
to failure of the above RAID system, and three that can be lost without 
failure. 


Solution to exercise 11.10 (p.190). The (7,4) Hamming code has codewords 
of weight 3. If any set of three disk drives corresponding to one of those code- 
words is lost, then the other four disks can recover only 3 bits of information 
about the four source bits; a fourth bit is lost. [cf. exercise 13.13 (p.220) with 
q = 2: there are no binary MDS codes. This deficit is discussed further in 
section 13.11.] 

Any other set of three disk drives can be lost without problems because 
the corresponding four by four submatrix of the generator matrix is invertible. 
A better code would be a digital fountain — see Chapter 50. 
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About Chapter 12 


In Chapters 1-11, we concentrated on two aspects of information theory and 
coding theory: source coding — the compression of information so as to make 
efficient use of data transmission and storage channels; and channel coding — 
the redundant encoding of information so as to be able to detect and correct 
communication errors. 

In both these areas we started by ignoring practical considerations, concen- 
trating on the question of the theoretical limitations and possibilities of coding. 
We then discussed practical source-coding and channel-coding schemes, shift- 
ing the emphasis towards computational feasibility. But the prime criterion 
for comparing encoding schemes remained the efficiency of the code in terms 
of the channel resources it required: the best source codes were those that 
achieved the greatest compression; the best channel codes were those that 
communicated at the highest rate with a given probability of error. 

In this chapter we now shift our viewpoint a little, thinking of ease of 
information retrieval as a primary goal. It turns out that the random codes 
which were theoretically useful in our study of channel coding are also useful 
for rapid information retrieval. 

Efficient information retrieval is one of the problems that brains seem to 
solve effortlessly, and content-addressable memory is one of the topics we will 
study when we look at neural networks. 
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12 


Hash Codes: Codes for Efficient 
Information Retrieval 


> 12.1 The information-retrieval problem 


A simple example of an information-retrieval problem is the task of imple- string length N ~ 200 
menting a phone directory service, which, in response to a person’s name, number of strings S ~ 223 
returns (a) a confirmation that that person is listed in the directory; and (b) number of possible 2^ ~ 2200 
the person’s phone number and other details. We could formalize this prob- strings 

lem as follows, with S being the number of names that must be stored in the 

directory. Figure 12.1. Cast of characters. 


You are given a list of S binary strings of length N bits, {x,...,x()}, 
where S is considerably smaller than the total number of possible strings, 2’. 
We will call the superscript ‘s’ in x“) the record number of the string. The 
idea is that s runs over customers in the order in which they are added to the 
directory and x“) is the name of customer s. We assume for simplicity that 
all people have names of the same length. The name length might be, say, 
N = 200 bits, and we might want to store the details of ten million customers, 
so S ~ 10’ ~ 273. We will ignore the possibility that two customers have 
identical names. 

The task is to construct the inverse of the mapping from s to x"), i.e., to 
make a system that, given a string x, returns the value of s such that x = x) 
if one exists, and otherwise reports that no such s exists. (Once we have the 
record number, we can go and look in memory location s in a separate memory 
full of phone numbers to find the required number.) The aim, when solving 
this task, is to use minimal computational resources in terms of the amount 
of memory used to store the inverse mapping from x to s and the amount of 
time to compute the inverse mapping. And, preferably, the inverse mapping 
should be implemented in such a way that further new strings can be added 
to the directory in a small amount of computer time too. 


Some standard solutions 


The simplest and dumbest solutions to the information-retrieval problem are 
a look-up table and a raw list. 


The look-up table is a piece of memory of size 2% logy S, logy S being the 
amount of memory required to store an integer between 1 and S. In 
each of the 2% locations, we put a zero, except for the locations x that 
correspond to strings x“), into which we write the value of s. 


The look-up table is a simple and quick solution, but only if there is 
sufficient memory for the table, and if the cost of looking up entries in 
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memory is independent of the memory size. But in our definition of the 
task, we assumed that N is about 200 bits or more, so the amount of 
memory required would be of size 229°; this solution is completely out 
of the question. Bear in mind that the number of particles in the solar 
system is only about 21%. 


The raw list is a simple list of ordered pairs (s,x)) ordered by the value 
of s. The mapping from x to s is achieved by searching through the list 
of strings, starting from the top, and comparing the incoming string x 
with each record x“) until a match is found. This system is very easy 
to maintain, and uses a small amount of memory, about SN bits, but 
is rather slow to use, since on average five million pairwise comparisons 
will be made. 


> Exercise 12.1.1% P202] Show that the average time taken to find the required 
string in a raw list, assuming that the original names were chosen at 
random, is about S + N binary comparisons. (Note that you don’t 
have to compare the whole string of length N, since a comparison can 
be terminated as soon as a mismatch occurs; show that you need on 
average two binary comparisons per incorrect string match.) Compare 
this with the worst-case search time — assuming that the devil chooses 
the set of strings and the search key. 


The standard way in which phone directories are made improves on the look-up 
table and the raw list by using an alphabetically-ordered list. 


Alphabetical list. The strings {x9} are sorted into alphabetical order. 
Searching for an entry now usually takes less time than was needed 
for the raw list because we can take advantage of the sortedness; for 
example, we can open the phonebook at its middle page, and compare 
the name we find there with the target string; if the target is ‘greater’ 
than the middle string then we know that the required string, if it exists, 
will be found in the second half of the alphabetical directory. Otherwise, 
we look in the first half. By iterating this splitting-in-the-middle proce- 
dure, we can identify the target string, or establish that the string is not 
listed, in [logy S] string comparisons. The expected number of binary 
comparisons per string comparison will tend to increase as the search 
progresses, but the total number of binary comparisons required will be 
no greater than [logy S|N. 


The amount of memory required is the same as that required for the raw 
list. 


Adding new strings to the database requires that we insert them in the 
correct location in the list. To find that location takes about flogs S] 
binary comparisons. 


Can we improve on the well-established alphabetized list? Let us consider 
our task from some new viewpoints. 

The task is to construct a mapping x — s from N bits to log, S bits. This 
is a pseudo-invertible mapping, since for any x that maps to a non-zero s, the 
customer database contains the pair (s,x)) that takes us back. Where have 
we come across the idea of mapping from N bits to M bits before? 

We encountered this idea twice: first, in source coding, we studied block 
codes which were mappings from strings of N symbols to a selection of one 
label in a list. The task of information retrieval is similar to the task (which 
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we never actually solved) of making an encoder for a typical-set compression 
code. 

The second time that we mapped bit strings to bit strings of another 
dimensionality was when we studied channel codes. There, we considered 
codes that mapped from K bits to N bits, with N greater than K, and we 
made theoretical progress using random codes. 

In hash codes, we put together these two notions. We will study random 
codes that map from N bits to M bits where M is smaller than N. 

The idea is that we will map the original high-dimensional space down into 
a lower-dimensional space, one in which it is feasible to implement the dumb 
look-up table method which we rejected a moment ago. 


string length N ~ 200 
number of strings Sw 278 
»> 12.2 Hash codes size of hash function M ~ 30 bits 
ize of hash tabl Tar 
First we will describe how a hash code works, then we will study the properties Cee S ~ 930 


of idealized hash codes. A hash code implements a solution to the information- 
retrieval problem, that is, a mapping from x to s, with the help of a pseudo- Figure 12.2. Revised cast of 
random function called a hash function, which maps the N-bit string x toan characters. 
M-bit string h(x), where M is smaller than N. M is typically chosen such that 
the ‘table size’ T ~ 2™ is a little bigger than S — say, ten times bigger. For 
example, if we were expecting S to be about a million, we might map x into 
a 30-bit hash h (regardless of the size N of each item x). The hash function 
is some fixed deterministic function which should ideally be indistinguishable 
from a fixed random code. For practical purposes, the hash function must be 
quick to compute. 
Two simple examples of hash functions are: 


Division method. The table size T is a prime number, preferably one that 
is not close to a power of 2. The hash value is the remainder when the 
integer x is divided by T. 


Variable string addition method. This method assumes that x is a string 
of bytes and that the table size T is 256. The characters of x are added, 
modulo 256. This hash function has the defect that it maps strings that 
are anagrams of each other onto the same hash. 


It may be improved by putting the running total through a fixed pseu- 
dorandom permutation after each character is added. In the variable 
string exclusive-or method with table size < 65 536, the string is hashed 
twice in this way, with the initial running total being set to 0 and 1 
respectively (algorithm 12.3). The result is a 16-bit hash. 


Having picked a hash function h(x), we implement an information retriever 
as follows. (See figure 12.4.) 


Encoding. A piece of memory called the hash table is created of size 2b 
memory units, where b is the amount of memory needed to represent an 
integer between 0 and S. This table is initially set to zero throughout. 
Each memory x“) is put through the hash function, and at the location 
in the hash table corresponding to the resulting vector h&) = h(x“), 
the integer s is written — unless that entry in the hash table is already 
occupied, in which case we have a collision between x‘) and some earlier 
xí") which both happen to have the same hash code. Collisions can be 
handled in various ways — we will discuss some in a moment — but first 
let us complete the basic picture. 
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Algorithm 12.3. C code 
unsigned char Rand8[256] ; This array contains a random implementing the variable string 
permutation from 0..255 to 0..255 exclusive-or method to create a 
int Hash(char *x) { x is a pointer to the first char; hash h in the range 0...65535 
int h; *x is the first character from a string x. Author: Thomas 
unsigned char hi, h2; Niemann. 


if (*x == 0) return 0; Special handling of empty string 
hi = *x; h2 = *x + 1; Initialize two hashes 
x++; Proceed to the next character 


while (*x) { 
h1 Rand8 [h1 ^ *x]; Exclusive-or with the two hashes 
h2 = Rand8[h2 ^ *x]; // and put through the randomizer 
xt+; 
} // End of string is reached when *x=0 
h = ((int)(h1)<<8) | // Shift h1 left 8 bits and add h2 
(int) h2 ; 
return h ; // Hash is concatenation of h1 and h2 








Hash 
x function Hash table 
Strings : : 
hashes M bits Figure 12.4. Use of hash functions 
I for information retrieval. For each 


string x), the hash h = h(x“)) 
is computed, and the value of s is 
written into the hth row of the 
hash table. Blank rows in the 
hash table contain the value zero. 
The table size is T = 2™. 


h(x?) > 
N bits 














h(x) => 





h(x®)) > 


ae Les, 


h(x) > 
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Decoding. To retrieve a piece of information corresponding to a target vector 
x, we compute the hash h of x and look at the corresponding location 
in the hash table. If there is a zero, then we know immediately that the 
string x is not in the database. The cost of this answer is the cost of one 
hash-function evaluation and one look-up in the table of size 2™. If, on 
the other hand, there is a non-zero entry s in the table, there are two 
possibilities: either the vector x is indeed equal to x“); or the vector x *) 
is another vector that happens to have the same hash code as the target 
x. (A third possibility is that this non-zero entry might have something 
to do with our yet-to-be-discussed collision-resolution system.) 


To check whether x is indeed equal to x“), we take the tentative answer 
s, look up x“) in the original forward database, and compare it bit by 
bit with x; if it matches then we report s as the desired answer. This 
successful retrieval has an overall cost of one hash-function evaluation, 
one look-up in the table of size 2”, another look-up in a table of size 
S, and N binary comparisons — which may be much cheaper than the 
simple solutions presented in section 12.1. 


> Exercise 12.2.1% P-202] Tf we have checked the first few bits of x°) with x and 
found them to be equal, what is the probability that the correct entry 
has been retrieved, if the alternative hypothesis is that x is actually not 
in the database? Assume that the original source strings are random, 
and the hash function is a random hash function. How many binary 
evaluations are needed to be sure with odds of a billion to one that the 
correct entry has been retrieved? 


The hashing method of information retrieval can be used for strings x of 
arbitrary length, if the hash function h(x) can be applied to strings of any 
length. 


> 12.3 Collision resolution 


We will study two ways of resolving collisions: appending in the table, and 
storing elsewhere. 


Appending in table 


When encoding, if a collision occurs, we continue down the hash table and 
write the value of s into the next available location in memory that currently 
contains a zero. If we reach the bottom of the table before encountering a 
zero, we continue from the top. 

When decoding, if we compute the hash code for x and find that the s 
contained in the table doesn’t point to an x‘) that matches the cue x, we 
continue down the hash table until we either find an s whose x‘*) does match 
the cue x, in which case we are done, or else encounter a zero, in which case 
we know that the cue x is not in the database. 

For this method, it is essential that the table be substantially bigger in size 
than S. If 2” < S then the encoding rule will become stuck with nowhere to 
put the last strings. 


Storing elsewhere 


A more robust and flexible method is to use pointers to additional pieces of 
memory in which collided strings are stored. There are many ways of doing 
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this. As an example, we could store in location h in the hash table a pointer 
(which must be distinguishable from a valid record number s) to a ‘bucket’ 
where all the strings that have hash code h are stored in a sorted list. The 
encoder sorts the strings in each bucket alphabetically as the hash table and 
buckets are created. 

The decoder simply has to go and look in the relevant bucket and then 
check the short list of strings that are there by a brief alphabetical search. 

This method of storing the strings in buckets allows the option of making 
the hash table quite small, which may have practical benefits. We may make it 
so small that almost all strings are involved in collisions, so all buckets contain 
a small number of strings. It only takes a small number of binary comparisons 
to identify which of the strings in the bucket matches the cue x. 


> 12.4 Planning for collisions: a birthday problem 


"Exercise 12.3.1% P-29] Tf we wish to store S entries using a hash function whose 
> output has M bits, how many collisions should we expect to happen, 
assuming that our hash function is an ideal random function? What 
size M of hash table is needed if we would like the expected number of 

collisions to be smaller than 1? 


What size M of hash table is needed if we would like the expected number 
of collisions to be a small fraction, say 1%, of S? 


[Notice the similarity of this problem to exercise 9.20 (p.156).] 


»> 12.5 Other roles for hash codes 


Checking arithmetic 


If you wish to check an addition that was done by hand, you may find useful 
the method of casting out nines. In casting out nines, one finds the sum, 
modulo nine, of all the digits of the numbers to be summed and compares 
it with the sum, modulo nine, of the digits of the putative answer. [With a 
little practice, these sums can be computed much more rapidly than the full 
original addition. ] 


Example 12.4. In the calculation shown in the margin the sum, modulo nine, of 189 
the digits in 189+1254+238 is 7, and the sum, modulo nine, of 1+6+8+1 +1254 
is 7. The calculation thus passes the casting-out-nines test. + 238 


1681 
Casting out nines gives a simple example of a hash function. For any 


addition expression of the form a+ b+c+---, where a,b,c,... are decimal 
numbers we define h € {0,1,2,3,4,5,6,7,8} by 


h(a+b+ec+---) = sum modulo nine of all digits in a,b,c ; (12.1) 
then it is nice property of decimal arithmetic that if 


atbtc+.-=m+4tntot:- (12.2) 














then the hashes h(a+b+c+---) and h(m+n+o+---) are equal. 


> Exercise 12.5.14 P-203] What evidence does a correct casting-out-nines match 
give in favour of the hypothesis that the addition has been done cor- 
rectly? 
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Error detection among friends 


Are two files the same? If the files are on the same computer, we could just 
compare them bit by bit. But if the two files are on separate machines, it 
would be nice to have a way of confirming that two files are identical without 
having to transfer one of the files from A to B. [And even if we did transfer one 
of the files, we would still like a way to confirm whether it has been received 
without modifications!] 

This problem can be solved using hash codes. Let Alice and Bob be the 
holders of the two files; Alice sent the file to Bob, and they wish to confirm 
it has been received without error. If Alice computes the hash of her file and 
sends it to Bob, and Bob computes the hash of his file, using the same M-bit 
hash function, and the two hashes match, then Bob can deduce that the two 
files are almost surely the same. 


Example 12.6. What is the probability of a false negative, i.e., the probability, 
given that the two files do differ, that the two hashes are nevertheless 
identical? 


If we assume that the hash function is random and that the process that causes 
the files to differ knows nothing about the hash function, then the probability 
of a false negative is 27™. o 
A 32-bit hash gives a probability of false negative of about 10710. It is 
common practice to use a linear hash function called a 32-bit cyclic redundancy 
check to detect errors in files. (A cyclic redundancy check is a set of 32 parity- 
check bits similar to the 3 parity-check bits of the (7,4) Hamming code.) 





To have a false-negative rate smaller than one in a billion, M = 32 


bits is plenty, if the errors are produced by noise. 





> Exercise 12.7.1% P203] Such a simple parity-check code only detects errors; it 
doesn’t help correct them. Since error-correcting codes exist, why not 
use one of them to get some error-correcting capability too? 


Tamper detection 


What if the differences between the two files are not simply ‘noise’, but are 
introduced by an adversary, a clever forger called Fiona, who modifies the 
original file to make a forgery that purports to be Alice’s file? How can Alice 
make a digital signature for the file so that Bob can confirm that no-one has 
tampered with the file? And how can we prevent Fiona from listening in on 
Alice’s signature and attaching it to other files? 

Let’s assume that Alice computes a hash function for the file and sends it 
securely to Bob. If Alice computes a simple hash function for the file like the 
linear cyclic redundancy check, and Fiona knows that this is the method of 
verifying the file’s integrity, Fiona can make her chosen modifications to the 
file and then easily identify (by linear algebra) a further 32-or-so single bits 
that, when flipped, restore the hash function of the file to its original value. 
Linear hash functions give no security against forgers. 

We must therefore require that the hash function be hard to invert so that 
no-one can construct a tampering that leaves the hash function unaffected. 
We would still like the hash function to be easy to compute, however, so that 
Bob doesn’t have to do hours of work to verify every file he received. Such 
a hash function — easy to compute, but hard to invert — is called a one-way 
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hash function. Finding such functions is one of the active research areas of 
cryptography. 

A hash function that is widely used in the free software community to 
confirm that two files do not differ is MD5, which produces a 128-bit hash. The 
details of how it works are quite complicated, involving convoluted exclusive- 
or-ing and if-ing and and-ing.! 

Even with a good one-way hash function, the digital signatures described 
above are still vulnerable to attack, if Fiona has access to the hash function. 
Fiona could take the tampered file and hunt for a further tiny modification to 
it such that its hash matches the original hash of Alice’s file. This would take 
some time — on average, about 2°? attempts, if the hash function has 32 bits — 
but eventually Fiona would find a tampered file that matches the given hash. 
To be secure against forgery, digital signatures must either have enough bits 
for such a random search to take too long, or the hash function itself must be 
kept secret. 


Fiona has to hash 2™ files to cheat. 2°? file modifications is not 
very many, so a 32-bit hash function is not large enough for forgery 


prevention. 





Another person who might have a motivation for forgery is Alice herself. 
For example, she might be making a bet on the outcome of a race, without 
wishing to broadcast her prediction publicly; a method for placing bets would 
be for her to send to Bob the bookie the hash of her bet. Later on, she could 
send Bob the details of her bet. Everyone can confirm that her bet is consis- 
tent with the previously publicized hash. [This method of secret publication 
was used by Isaac Newton and Robert Hooke when they wished to establish 
priority for scientific ideas without revealing them. Hooke’s hash function 
was alphabetization as illustrated by the conversion of UT TENSIO, SIC VIS 
into the anagram CEIIINOSSSTTUV.] Such a protocol relies on the assumption 
that Alice cannot change her bet after the event without the hash coming 
out wrong. How big a hash function do we need to use to ensure that Alice 
cannot cheat? The answer is different from the size of the hash we needed in 
order to defeat Fiona above, because Alice is the author of both files. Alice 
could cheat by searching for two files that have identical hashes to each other. 
For example, if she’d like to cheat by placing two bets for the price of one, 
she could make a large number Nj of versions of bet one (differing from each 
other in minor details only), and a large number Nə of versions of bet two, and 
hash them all. If there’s a collision between the hashes of two bets of different 
types, then she can submit the common hash and thus buy herself the option 
of placing either bet. 


Example 12.8. If the hash has M bits, how big do Nı and Nə need to be for 
Alice to have a good chance of finding two different bets with the same 
hash? 


This is a birthday problem like exercise 9.20 (p.156). If there are N; Montagues 


and Nz Capulets at a party, and each is assigned a ‘birthday’ of M bits, the 
expected number of collisions between a Montague and a Capulet is 


N,NQ2™, (12.3) 


‘http://www. freesoft .org/CIE/RFC/1321/3.htm 
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so to minimize the number of files hashed, N1 + N2, Alice should make N1 
and N2 equal, and will need to hash about 2™/? files until she finds two that 


match. 


o 





Alice has to hash 2™/2 files to cheat. [This is the square root of the 


number of hashes Fiona had to make.] 





If Alice has the use of C = 10° computers for T = 10 years, each computer 
taking t = 1ns to evaluate a hash, the bet-communication system is secure 
against Alice’s dishonesty only if M >> 2log,CT/t ~ 160 bits. 


Further reading 


The Bible for hash codes is volume 3 of Knuth (1968). I highly recommend the 
story of Doug Mcllroy’s spell program, as told in section 13.8 of Programming 
Pearls (Bentley, 2000). This astonishing piece of software makes use of a 64- 
kilobyte data structure to store the spellings of all the words of 75 000-word 


dictionary. 


> 12.6 Further exercises 


Exercise 12.9.!4] What is the shortest the address on a typical international 


>= letter could be, if it is to get to a unique human recipient? (Assume 
the permitted characters are [A-Z,0-9].) How long are typical email 
addresses? 


Exercise 12.10.17 P203] How long does a piece of text need to be for you to be 
> pretty sure that no human has written that string of characters before? 
How many notes are there in a new melody that has not been composed 

before? 


> Exercise 12.11.19 P204] Pattern recognition by molecules. 


Some proteins produced in a cell have a regulatory role. A regulatory 


protein controls the transcription of specific genes in the genome. This 
control often involves the protein’s binding to a particular DNA sequence 
in the vicinity of the regulated gene. The presence of the bound protein 
either promotes or inhibits transcription of the gene. 


(a) 


Use information-theoretic arguments to obtain a lower bound on 
the size of a typical protein that acts as a regulator specific to one 
gene in the whole human genome. Assume that the genome is a 
sequence of 3 x 10° nucleotides drawn from a four letter alphabet 
{A,C,G,T}; a protein is a sequence of amino acids drawn from a 
twenty letter alphabet. [Hint: establish how long the recognized 
DNA sequence has to be in order for that sequence to be unique 
to the vicinity of one gene, treating the rest of the genome as a 
random sequence. Then discuss how big the protein must be to 
recognize a sequence of that length uniquely.] 


Some of the sequences recognized by DNA-binding regulatory pro- 
teins consist of a subsequence that is repeated twice or more, for 
example the sequence 


GCCCCCCACCCCTGCCCCC (12.4) 
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is a binding site found upstream of the alpha-actin gene in humans. 
Does the fact that some binding sites consist of a repeated subse- 
quence influence your answer to part (a)? 


> 12.7 Solutions 


Solution to exercise 12.1 (p.194). First imagine comparing the string x with 
another random string x“). The probability that the first bits of the two 
strings match is 1/2. The probability that the second bits match is 1/2. As- 
suming we stop comparing once we hit the first mismatch, the expected number 
of matches is 1, so the expected number of comparisons is 2 (exercise 2.34, 
p.38). 

Assuming the correct string is located at random in the raw list, we will 
have to compare with an average of S/2 strings before we find it, which costs 
25/2 binary comparisons; and comparing the correct strings takes N binary 
comparisons, giving a total expectation of S + N binary comparisons, if the 
strings are chosen at random. 

In the worst case (which may indeed happen in practice), the other strings 
are very similar to the search key, so that a lengthy sequence of comparisons 
is needed to find each mismatch. The worst case is when the correct string 
is last in the list, and all the other strings differ in the last bit only, giving a 
requirement of SN binary comparisons. 


Solution to exercise 12.2 (p.197). The likelihood ratio for the two hypotheses, 
Ho: x) = x, and H1: x) Æ x, contributed by the datum ‘the first bits of 
x(*) and x are equal’ is 
P(Datum | Ho) = as 2. (12.5) 
P(Datum| H1) 1/2 
If the first r bits all match, the likelihood ratio is 2” to one. On finding that 
30 bits match, the odds are a billion to one in favour of Ho, assuming we start 
from even odds. [For a complete answer, we should compute the evidence 
given by the prior information that the hash entry s has been found in the 
table at h(x). This fact gives further evidence in favour of Ho.] 


Solution to exercise 12.3 (p.198). Let the hash function have an output al- 
phabet of size T = 2. If M were equal to log, S then we would have exactly 
enough bits for each entry to have its own unique hash. The probability that 
one particular pair of entries collide under a random hash function is 1/T. The 
number of pairs is S(S — 1)/2. So the expected number of collisions between 
pairs is exactly 


S(S —1)/(2T). (12.6) 
If we would like this to be smaller than 1, then we need T > S(S — 1)/2 so 


M > 2log; S. (12.7) 


We need twice as many bits as the number of bits, log, S, that would be 
sufficient to give each entry a unique name. 

If we are happy to have occasional collisions, involving a fraction f of the 
names S, then we need T > S/f (since the probability that one particular 
name is collided-with is f ~ S/T) so 


M > log, S + logs[{1/f], (12.8) 
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which means for f ~ 0.01 that we need an extra 7 bits above log, S. 

The important point to note is the scaling of T with S in the two cases 
(12.7, 12.8). If we want the hash function to be collision-free, then we must 
have T greater than ~ S°. If we are happy to have a small frequency of 
collisions, then T needs to be of order S only. 


Solution to exercise 12.5 (p.198). The posterior probability ratio for the two 
hypotheses, H} = ‘calculation correct’ and H— = ‘calculation incorrect’ is 
the product of the prior probability ratio P(H;)/P(H_) and the likelihood 
ratio, P(match |H+)/P(match|#_). This second factor is the answer to the 
question. The numerator P(match|H4) is equal to 1. The denominator’s 
value depends on our model of errors. If we know that the human calculator is 
prone to errors involving multiplication of the answer by 10, or to transposition 
of adjacent digits, neither of which affects the hash value, then P(match | H_) 
could be equal to 1 also, so that the correct match gives no evidence in favour 
of H4. But if we assume that errors are ‘random from the point of view of the 
hash function’ then the probability of a false positive is P(match |H_) = 1/9, 
and the correct match gives evidence 9:1 in favour of H4. 


Solution to exercise 12.7 (p.199). If you add a tiny M = 32 extra bits of hash 
to a huge N-bit file you get pretty good error detection — the probability that 
an error is undetected is 2~™, less than one in a billion. To do error correction 
requires far more check bits, the number depending on the expected types of 
corruption, and on the file size. For example, if just eight random bits in a 
megabyte file are corrupted, it would take about logy Ce) ~ 23 x 8 ~ 180 
bits to specify which are the corrupted bits, and the number of parity-check 
bits used by a successful error-correcting code would have to be at least this 
number, by the counting argument of exercise 1.10 (solution, p.20). 


Solution to exercise 12.10 (p.201). We want to know the length L of a string 
such that it is very improbable that that string matches any part of the entire 
writings of humanity. Let’s estimate that these writings total about one book 
for each person living, and that each book contains two million characters (200 
pages with 10000 characters per page) — that’s 101° characters, drawn from 
an alphabet of, say, 37 characters. 

The probability that a randomly chosen string of length L matches at one 
point in the collected works of humanity is 1/37". So the expected number 
of matches is 1016/374, which is vanishingly small if L > 16/logy937 ~ 10. 
Because of the redundancy and repetition of humanity’s writings, it is possible 
that L œ 10 is an overestimate. 

So, if you want to write something unique, sit down and compose a string 
of ten characters. But don’t write gidnebinzz, because I already thought of 
that string. 

As for a new melody, if we focus on the sequence of notes, ignoring duration 
and stress, and allow leaps of up to an octave at each note, then the number 
of choices per note is 23. The pitch of the first note is arbitrary. The number 
of melodies of length r notes in this rather ugly ensemble of Schénbergian 
tunes is 23’~1; for example, there are 250000 of length r = 5. Restricting 
the permitted intervals will reduce this figure; including duration and stress 
will increase it again. [If we restrict the permitted intervals to repetitions and 
tones or semitones, the reduction is particularly severe; is this why the melody 
of ‘Ode to Joy’ sounds so boring?]| The number of recorded compositions is 
probably less than a million. If you learn 100 new melodies per week for every 
week of your life then you will have learned 250000 melodies at age 50. Based 
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on empirical experience of playing the game ‘guess that tune’, it seems to In guess that tune, one player 
me that whereas many four-note sequences are shared in common between chooses a melody, and sings a 
melodies, the number of collisions between five-note sequences is rather smaller 8radually-increasing number of its 


. notes, while the other participants 
— most famous five-note sequences are unique. 
try to guess the whole melody. 


Solution to exercise 12.11 (p.201). (a) Let the DNA-binding protein recognize The Parsons code is a related hash 
a sequence of length L nucleotides. That is, it binds preferentially to that function for melodies: each pair of 
DNA sequence, and not to any other pieces of DNA in the whole genome. (In consecutive notes is coded as U 
reality, the recognized sequence may contain some wildcard characters, e.g., (‘up’) if the ae note fa higher 
the * in TATAA*A, which denotes ‘any of A, C, G and T’; so, to be precise, we are than the first, Ril repeat ) if ue 
i : i pitches are equal, and D (‘down’) 

assuming that the recognized sequence contains L non-wildcard characters.) otherwise. Youcan ind out how 

Assuming the rest of the genome is ‘random’, i.e., that the sequence con- well this hash function works at 
sists of random nucleotides A, C, G and T with equal probability — which is http://musipedia.org/. 
obviously untrue, but it shouldn’t make too much difference to our calculation 
— the chance that there is no other occurrence of the target sequence in the 
whole genome, of length N nucleotides, is roughly 


(1 — (1/4)")" = exp(—N(1/4)”), (12.9) 
which is close to one only if 
N4% <1, (12.10) 
that is, 
L > log N/log 4. (12.11) 


Using N = 3 x 10°, we require the recognized sequence to be longer than 
Lmin = 16 nucleotides. 
What size of protein does this imply? 


e A weak lower bound can be obtained by assuming that the information 
content of the protein sequence itself is greater than the information 
content of the nucleotide sequence the protein prefers to bind to (which 
we have argued above must be at least 32 bits). This gives a minimum 
protein length of 32/log,(20) ~ 7 amino acids. 


e Thinking realistically, the recognition of the DNA sequence by the pro- 
tein presumably involves the protein coming into contact with all sixteen 
nucleotides in the target sequence. If the protein is a monomer, it must 
be big enough that it can simultaneously make contact with sixteen nu- 
cleotides of DNA. One helical turn of DNA containing ten nucleotides 
has a length of 3.4nm, so a contiguous sequence of sixteen nucleotides 
has a length of 5.4nm. The diameter of the protein must therefore be 
about 5.4nm or greater. Egg-white lysozyme is a small globular protein 
with a length of 129 amino acids and a diameter of about 4nm. As- 
suming that volume is proportional to sequence length and that volume 
scales as the cube of the diameter, a protein of diameter 5.4nm must 
have a sequence of length 2.5 x 129 ~ 324 amino acids. 


(b) If, however, a target sequence consists of a twice-repeated sub-sequence, we 
can get by with a much smaller protein that recognizes only the sub-sequence, 
and that binds to the DNA strongly only if it can form a dimer, both halves 
of which are bound to the recognized sequence. Halving the diameter of the 
protein, we now only need a protein whose length is greater than 324/8 = 40 
amino acids. A protein of length smaller than this cannot by itself serve as 
a regulatory protein specific to one gene, because it’s simply too small to be 
able to make a sufficiently specific match — its available surface does not have 
enough information content. 
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About Chapter 13 


In Chapters 8-11, we established Shannon’s noisy-channel coding theorem 
for a general channel with any input and output alphabets. A great deal of 
attention in coding theory focuses on the special case of channels with binary 
inputs. Constraining ourselves to these channels simplifies matters, and leads 
us into an exceptionally rich world, which we will only taste in this book. 

One of the aims of this chapter is to point out a contrast between Shannon’s 
aim of achieving reliable communication over a noisy channel and the apparent 
aim of many in the world of coding theory. Many coding theorists take as 
their fundamental problem the task of packing as many spheres as possible, 
with radius as large as possible, into an N-dimensional space, with no spheres 
overlapping. Prizes are awarded to people who find packings that squeeze in an 
extra few spheres. While this is a fascinating mathematical topic, we shall see 
that the aim of maximizing the distance between codewords in a code has only 
a tenuous relationship to Shannon’s aim of reliable communication. 
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Binary Codes 


We’ve established Shannon’s noisy-channel coding theorem for a general chan- 
nel with any input and output alphabets. A great deal of attention in coding 
theory focuses on the special case of channels with binary inputs, the first 
implicit choice being the binary symmetric channel. 

The optimal decoder for a code, given a binary symmetric channel, finds 
the codeword that is closest to the received vector, closest in Hamming dis- Example: 
tance. The Hamming distance between two binary vectors is the number of The Hamming distance 
coordinates in which the two vectors differ. Decoding errors will occur if the between 00001111 
noise takes us from the transmitted codeword t to a received vector r that and ; 11091101 
is closer to some other codeword. The distances between codewords are thus isa 
relevant to the probability of a decoding error. 


»> 13.1 Distance properties of a code 


The distance of a code is the smallest separation between two of its codewords. 


Example 13.1. The (7,4) Hamming code (p.8) has distance d = 3. All pairs of 
its codewords differ in at least 3 bits. The maximum number of errors 
it can correct is t = 1; in general a code with distance d is |(d—1)/2|- 
error-correcting. 


A more precise term for distance is the minimum distance of the code. The 
distance of a code is often denoted by d or dmin- 

We’ll now constrain our attention to linear codes. In a linear code, all Total 16 
codewords have identical distance properties, so we can summarize all the E 
distances between the code’s codewords by counting the distances from the 
all-zero codeword. 

The weight enumerator function of a code, A(w), is defined to be the 
number of codewords in the code that have weight w. The weight enumerator 
function is also known as the distance distribution of the code. 
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Example 13.2. The weight enumerator functions of the (7,4) Hamming code 
and the dodecahedron code are shown in figures 13.1 and 13.2. Figure 13.1. The graph of the 
7,4) Hamming code, and its 
8 
> 13.2 Obsession with distance weight enumerator function. 


Since the maximum number of errors that a code can guarantee to correct, 

t, is related to its distance d by t = |(d—1)/2], many coding theorists focus d= 2t + 1 if d is odd, and 
on the distance of a code, searching for codes of a given size that have the d= 2t + 2 if d is even. 
biggest possible distance. Much of practical coding theory has focused on 

decoders that give the optimal decoding for all error patterns of weight up to 

the half-distance t of their codes. 
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A bounded-distance decoder is a decoder that returns the closest code- 
word to a received binary vector r if the distance from r to that codeword 
is less than or equal to t; otherwise it returns a failure message. 


The rationale for not trying to decode when more than t errors have occurred 
might be ‘we can’t guarantee that we can correct more than t errors, so we 
won't bother trying — who would be interested in a decoder that corrects some 
error patterns of weight greater than t, but not others?’ This defeatist attitude 
is an example of worst-case-ism, a widespread mental ailment which this book 
is intended to cure. 

The fact is that bounded-distance decoders cannot reach the Shannon limit 
of the binary symmetric channel; only a decoder that often corrects more than 
t errors can do this. The state of the art in error-correcting codes have decoders 
that work way beyond the minimum distance of the code. 


Definitions of good and bad distance properties 


Given a family of codes of increasing blocklength N, and with rates approach- 
ing a limit R > 0, we may be able to put that family in one of the following 
categories, which have some similarities to the categories of ‘good’ and ‘bad’ 
codes defined earlier (p.183): 


A sequence of codes has ‘good’ distance if d/N tends to a constant 
greater than zero. 


A sequence of codes has ‘bad’ distance if d/N tends to zero. 


A sequence of codes has ‘very bad’ distance if d tends to a constant. 


Example 13.3. A low-density generator-matrix code is a linear code whose K x 
N generator matrix G has a small number do of 1s per row, regardless 
of how big N is. The minimum distance of such a code is at most do, so 
low-density generator-matrix codes have ‘very bad’ distance. 


While having large distance is no bad thing, we’ll see, later on, why an 
emphasis on distance can be unhealthy. 
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Figure 13.2. The graph defining 
the (30,11) dodecahedron code 
(the circles are the 30 transmitted 
bits and the triangles are the 20 
parity checks, one of which is 
redundant) and the weight 
enumerator function (solid lines). 
The dotted lines show the average 
weight enumerator function of all 
random linear codes with the 
same size of generator matrix, 
which will be computed shortly. 
The lower figure shows the same 
functions on a log scale. 
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Figure 13.3. The graph of a 
rate-!/2 low-density 
generator-matrix code. The 
rightmost M of the transmitted 
bits are each connected to a single 
distinct parity constraint. The 
leftmost K transmitted bits are 
each connected to a small number 
of parity constraints. 
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Figure 13.4. Schematic picture of 
part of Hamming space perfectly 
filled by t-spheres centred on the 
codewords of a perfect code. 





> 13.3 Perfect codes 


A t-sphere (or a sphere of radius t) in Hamming space, centred on a point x, 
is the set of points whose Hamming distance from x is less than or equal to t. 

The (7,4) Hamming code has the beautiful property that if we place 1- 
spheres about each of its 16 codewords, those spheres perfectly fill Hamming 
space without overlapping. As we saw in Chapter 1, every binary vector of 
length 7 is within a distance of t = 1 of exactly one codeword of the Hamming 
code. 


A code is a perfect t-error-correcting code if the set of t-spheres cen- 
tred on the codewords of the code fill the Hamming space without over- 
lapping. (See figure 13.4.) 


Let’s recap our cast of characters. The number of codewords is S = 2*. 
The number of points in the entire Hamming space is 2N. The number of 
points in a Hamming sphere of radius t is 


dX e (13.1) 


For a code to be perfect with these parameters, we require S times the number 
of points in the t-sphere to equal 2%: 


t 
N 
for a perfect code, 2k S ( ) =N (13.2) 
w 
w=0 
“(N 
or, equivalently, > ( ) =. pN, (13.3) 
w 


w=0 


For a perfect code, the number of noise vectors in one sphere must equal 
the number of possible syndromes. The (7,4) Hamming code satisfies this 
numerological condition because 


1+ o = 2°. (13.4) 
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Figure 13.5. Schematic picture of 
Hamming space not perfectly 
filled by t-spheres centred on the 
codewords of a code. The grey 
regions show points that are at a 
Hamming distance of more than t 
from any codeword. This is a 
misleading picture, as, for any 
code with large t in high 
dimensions, the grey space 
between the spheres takes up 
almost all of Hamming space. 





How happy we would be to use perfect codes 


If there were large numbers of perfect codes to choose from, with a wide 
range of blocklengths and rates, then these would be the perfect solution to 
Shannon’s problem. We could communicate over a binary symmetric channel 

with noise level f, for example, by picking a perfect t-error-correcting code 

with blocklength N and t = f*N, where f* = f +6 and N and 6 are chosen 

such that the probability that the noise flips more than t bits is satisfactorily 
small. 

However, there are almost no perfect codes. The only nontrivial perfect X 

binary codes are 


1. the Hamming codes, which are perfect codes with t = 1 and blocklength 
N = 2™ — 1, defined below; the rate of a Hamming code approaches 1 
as its blocklength N increases; 


2. the repetition codes of odd blocklength N, which are perfect codes with 
t = (N — 1)/2; the rate of repetition codes goes to zero as 1/N; and 


3. one remarkable 3-error-correcting code with 21? codewords of block- 
length N = 23 known as the binary Golay code. [A second 2-error- 
correcting Golay code of length N = 11 over a ternary alphabet was dis- 
covered by a Finnish football-pool enthusiast called Juhani Virtakallio 
in 1947,] 


There are no other binary perfect codes. Why this shortage of perfect codes? 
Is it because precise numerological coincidences like those satisfied by the 
parameters of the Hamming code (13.4) and the Golay code, 


14 i) | (?) (o) =o (13.5) 


are rare? Are there plenty of ‘almost-perfect’ codes for which the t-spheres fill 
almost the whole space? 





No. In fact, the picture of Hamming spheres centred on the codewords 
almost filling Hamming space (figure 13.5) is a misleading one: for most codes, 
whether they are good codes or bad codes, almost all the Hamming space is 
taken up by the space between t-spheres (which is shown in grey in figure 13.5). 


Having established this gloomy picture, we spend a moment filling in the 
properties of the perfect codes mentioned above. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


210 13 — Binary Codes 


00000000000000000000000000000 Figure 13.6. Three codewords. 
11111111111111111110000000000 
00000011111111111111111110000 
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The Hamming codes 


The (7,4) Hamming code can be defined as the linear code whose 3 x 7 parity- 
check matrix contains, as its columns, all the 7 (= 23 — 1) non-zero vectors of 
length 3. Since these 7 vectors are all different, any single bit-flip produces a 
distinct syndrome, so all single-bit errors can be detected and corrected. 

We can generalize this code, with M = 3 parity constraints, as follows. The 
Hamming codes are single-error-correcting codes defined by picking a number 
of parity-check constraints, M; the blocklength N is N = 2™ — 1; the parity- 
check matrix contains, as its columns, all the N non-zero vectors of length M 
bits. 

The first few Hamming codes have the following rates: 


Checks, M. (N,K) R=K/N 


2 (3,1) 1/3 repetition code R3 

3 (7,4) 4/7 (7,4) Hamming code 
4 (15,11) 11/15 

5 (31,26) 26/31 

6 (63,57) 57/63 


Exercise 13.4.1% P223] What is the probability of block error of the (N, K) 
= Hamming code to leading order, when the code is used for a binary 
symmetric channel with noise density f? 


> 13.4 Perfectness is unattainable — first proof 


We will show in several ways that useful perfect codes do not exist (here, 
‘useful’ means ‘having large blocklength N, and rate close neither to 0 nor 1’). 

Shannon proved that, given a binary symmetric channel with any noise 
level f, there exist codes with large blocklength N and rate as close as you 
like to C(f) = 1 — Hə( f) that enable communication with arbitrarily small 
error probability. For large N, the number of errors per block will typically be 
about fN, so these codes of Shannon are ‘almost-certainly- fN-error-correcting’ 
codes. 

Let’s pick the special case of a noisy channel with f € (1/3,1/2). Can 
we find a large perfect code that is fN-error-correcting? Well, let’s suppose 
that such a code has been found, and examine just three of its codewords. 
(Remember that the code ought to have rate R ~ 1— H2(f), so it should have 
an enormous number (2‘”) of codewords.) Without loss of generality, we 
choose one of the codewords to be the all-zero codeword and define the other 
two to have overlaps with it as shown in figure 13.6. The second codeword 
differs from the first in a fraction u +v of its coordinates. The third codeword 
differs from the first in a fraction v + w, and from the second in a fraction 
u+w. A fraction x of the coordinates have value zero in all three codewords. 
Now, if the code is fN-error-correcting, its minimum distance must be greater 


13.5: Weight enumerator function of random linear codes 


than 2fN, so 
utvu>2f, v+w>2f, and ut+-w>2f. (13.6) 
Summing these three inequalities and dividing by two, we have 
utut+tw > 3f. (13.7) 


So if f > 1/3, we can deduce u+v +w > 1, so that x < 0, which is impossible. 
Such a code cannot exist. So the code cannot have three codewords, let alone 
DNE 

We conclude that, whereas Shannon proved there are plenty of codes for 
communicating over a binary symmetric channel with f > 1/3, there are no 
perfect codes that can do this. 

We now study a more general argument that indicates that there are no 
large perfect linear codes for general rates (other than 0 and 1). We do this 
by finding the typical distance of a random linear code. 


13.5 Weight enumerator function of random linear codes 


Imagine making a code by picking the binary entries in the M x N parity-check 

matrix H at random. What weight enumerator function should we expect? 
The weight enumerator of one particular code with parity-check matrix H, 
A(w)#, is the number of codewords of weight w, which can be written 
Aw)a = 5° 1[Hx=0], (13.8) 


x:|x|=w 


where the sum is over all vectors x whose weight is w and the truth function 
1[Hx = 0] equals one if Hx = 0 and zero otherwise. 
We can find the expected value of A(w), 


(A(w)) (13.9) 


>> P(A) Aw) 4 
H 
> So P(A)1[Hx=0), 


x:|x|=w H 


(13.10) 


by evaluating the probability that a particular word of weight w > 0 is a 
codeword of the code (averaging over all binary linear codes in our ensemble). 
By symmetry, this probability depends only on the weight w of the word, not 
on the details of the word. The probability that the entire syndrome Hx is 
zero can be found by multiplying together the probabilities that each of the 
M bits in the syndrome is zero. Each bit zm of the syndrome is a sum (mod 
2) of w random bits, so the probability that zm =0 is 1/2. The probability that 
Hx =0 is thus 
X P(H) 1[Hx=0] = (1/2) =2™, (13.11) 
H 
independent of w. 
The expected number of words of weight w (13.10) is given by summing, 
over all words of weight w, the probability that each word is a codeword. The 
number of words of weight w is EN; so 


N 


i (13.12) 


jem for any w > 0. 


(Aw) = ( 
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N 
mu aħõă 
101010100100110100010110 
001110111100011001101000 
101110111001011000110100 
000010111100101101001000 
000000110011110100000100 
110010001111100000101110 
101111100010100001001110 
110010110001101011101010 
100011100101000010111101 
010001000010101001101010 
010111110111111110111010 
101110101001001101000011 


Figure 13.7. A random binary 
parity-check matrix. 
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For large N, we can use log (x) ~ NH»2(w/N) and R1- M/N to write 


logs(A(w)) ~ NHo(w/N) —-M 


N|Ho2(w/N) — (1 — R)] for any w > 0. 


(13.13) 
(13.14) 


K 


As a concrete example, figure 13.8 shows the expected weight enumerator 
function of a rate-1/3 random linear code with N = 540 and M = 360. 


Gilbert- Varshamov distance 


For weights w such that Hə(w/N) < (1 — R), the expectation of A(w) is 
smaller than 1; for weights such that Hə(w/N) > (1 — R), the expectation is 
greater than 1. We thus expect, for large N, that the minimum distance of a 
random linear code will be close to the distance day defined by 
Ho(dev/N) = (1 — R). (13.15) 
Definition. This distance, dey = NH3 '(1 — R), is the Gilbert-Varshamov 
distance for rate R and blocklength N. 
The Gilbert-Varshamov conjecture, widely believed, asserts that (for large 
N) it is not possible to create binary codes with minimum distance significantly 
greater than day. 


Definition. The Gilbert-Varshamov rate Ray is the maximum rate at which 
you can reliably communicate with a bounded-distance decoder (as defined on 
p.207), assuming that the Gilbert-Varshamov conjecture is true. 


Why sphere-packing is a bad perspective, and an obsession with distance 
is inappropriate 


If one uses a bounded-distance decoder, the maximum tolerable noise level 
will flip a fraction fpa = Zdmin /N of the bits. So, assuming dmin is equal to 
the Gilbert distance dgy (13.15), we have: 


H2(2fba) = (1 — Rev). (13.16) 


Rey = 1 — H2(2 fra). 


Now, here’s the crunch: what did Shannon say is achievable? He said the 
maximum possible rate of communication is the capacity, 


(13.17) 


C=1- Hd(f). (13.18) 
So for a given rate R, the maximum tolerable noise level, according to Shannon, 
is given by 


Ao(f) = (1— R). 


Our conclusion: imagine a good code of rate R has been chosen; equations 
(13.16) and (13.19) respectively define the maximum noise levels tolerable by 
a bounded-distance decoder, fq, and by Shannon’s decoder, f. 


(13.19) 


foa = f/2. (13.20) 


Bounded-distance decoders can only ever cope with half the noise-level that 
Shannon proved is tolerable! 

How does this relate to perfect codes? A code is perfect if there are t- 
spheres around its codewords that fill Hamming space without overlapping. 
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Figure 13.8. The expected weight 
enumerator function (A(w)) of a 

random linear code with N = 540 
and M = 360. Lower figure shows 
(A(w)) on a logarithmic scale. 
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Figure 13.9. Contrast between 
Shannon’s channel capacity C and 
the Gilbert rate Rev — the 
maximum communication rate 
achievable using a 
bounded-distance decoder, as a 
function of noise level f. For any 
given rate, R, the maximum 
tolerable noise level for Shannon 
is twice as big as the maximum 
tolerable noise level for a 
‘worst-case-ist’ who uses a 
bounded-distance decoder. 





13.6: Berlekamp’s bats 


But when a typical random linear code is used to communicate over a bi- 
nary symmetric channel near to the Shannon limit, the typical number of bits 
flipped is fN, and the minimum distance between codewords is also fN, or 
a little bigger, if we are a little below the Shannon limit. So the fN-spheres 
around the codewords overlap with each other sufficiently that each sphere 
almost contains the centre of its nearest neighbour! The reason why this 
overlap is not disastrous is because, in high dimensions, the volume associated 
with the overlap, shown shaded in figure 13.10, is a tiny fraction of either 
sphere, so the probability of landing in it is extremely small. 

The moral of the story is that worst-case-ism can be bad for you, halving 
your ability to tolerate noise. You have to be able to decode way beyond the 
minimum distance of a code to get to the Shannon limit! 

Nevertheless, the minimum distance of a code is of interest in practice, 
because, under some conditions, the minimum distance dominates the errors 
made by a code. 


13.6 Berlekamp’s bats 


A blind bat lives in a cave. It flies about the centre of the cave, which corre- 
sponds to one codeword, with its typical distance from the centre controlled 
by a friskiness parameter f. (The displacement of the bat from the centre 
corresponds to the noise vector.) The boundaries of the cave are made up of 
stalactites that point in towards the centre of the cave (figure 13.11). Each 
stalactite is analogous to the boundary between the home codeword and an- 
other codeword. The stalactite is like the shaded region in figure 13.10, but 
reshaped to convey the idea that it is a region of very small volume. 

Decoding errors correspond to the bat’s intended trajectory passing inside 
a stalactite. Collisions with stalactites at various distances from the centre 
are possible. 

If the friskiness is very small, the bat is usually very close to the centre 
of the cave; collisions will be rare, and when they do occur, they will usually 
involve the stalactites whose tips are closest to the centre point. Similarly, 
under low-noise conditions, decoding errors will be rare, and they will typi- 
cally involve low-weight codewords. Under low-noise conditions, the minimum 
distance of a code is relevant to the (very small) probability of error. 
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Figure 13.10. Two overlapping 
spheres whose radius is almost as 
big as the distance between their 
centres. 


Figure 13.11. Berlekamp’s 
schematic picture of Hamming 
space in the vicinity of a 
codeword. The jagged solid line 
encloses all points to which this 
codeword is the closest. The 
t-sphere around the codeword 
takes up a small fraction of this 
space. 
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If the friskiness is higher, the bat may often make excursions beyond the 
safe distance t where the longest stalactites start, but it will collide most fre- 
quently with more distant stalactites, owing to their greater number. There’s 
only a tiny number of stalactites at the minimum distance, so they are rela- 
tively unlikely to cause the errors. Similarly, errors in a real error-correcting 
code depend on the properties of the weight enumerator function. 

At very high friskiness, the bat is always a long way from the centre of 
the cave, and almost all its collisions involve contact with distant stalactites. 
Under these conditions, the bat’s collision frequency has nothing to do with 
the distance from the centre to the closest stalactite. 


13.7 Concatenation of Hamming codes 


It is instructive to play some more with the concatenation of Hamming codes, 
a concept we first visited in figure 11.6, because we will get insights into the 
notion of good codes and the relevance or otherwise of the minimum distance 
of a code. 

We can create a concatenated code for a binary symmetric channel with 
noise density f by encoding with several Hamming codes in succession. 

The table recaps the key properties of the Hamming codes, indexed by 
number of constraints, M. All the Hamming codes have minimum distance 
d = 3 and can correct one error in N. 


N=2M -1 
K=N-M 
PB = 2(3) f? probability of block error to leading order 


blocklength 
number of source bits 


If we make a product code by concatenating a sequence of C Hamming 
codes with increasing M, we can choose those parameters sy ee in such a 
way that the rate of the product code 


(13.21) 


tends to a non-zero limit as C increases. For example, if we set M, = 2, 
M> = 3, M3 = 4, etc., then the asymptotic rate is 0.093 (figure 13.12). 

The blocklength N is a rapidly-growing function of C, so these codes are 
somewhat impractical. A further weakness of these codes is that their min- 
imum distance is not very good (figure 13.13). Every one of the constituent 
Hamming codes has minimum distance 3, so the minimum distance of the 
Cth product is 3°. The blocklength N grows faster than 3°, so the ratio d/N 
tends to zero as C increases. In contrast, for typical random codes, the ratio 
d/N tends to a constant such that Ho(d/N) = 1— R. Concatenated Hamming 
codes thus have ‘bad’ distance. 

Nevertheless, it turns out that this simple sequence of codes yields good 
codes for some channels — but not very good codes (see section 11.4 to recall 
the definitions of the terms ‘good’ and ‘very good’). Rather than prove this 
result, we will simply explore it numerically. 

Figure 13.14 shows the bit error probability pp of the concatenated codes 
assuming that the constituent codes are decoded in sequence, as described 
in section 11.4. [This one-code-at-a-time decoding is suboptimal, as we saw 
there.]| The horizontal axis shows the rates of the codes. As the number 
of concatenations increases, the rate drops to 0.093 and the error probability 
drops towards zero. The channel assumed in the figure is the binary symmetric 
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13.8: Distance isn’t everything 


channel with f = 0.0588. This is the highest noise level that can be tolerated 
using this concatenated code. 

The take-home message from this story is distance isn’t everything. The 
minimum distance of a code, although widely worshipped by coding theorists, 
is not of fundamental importance to Shannon’s mission of achieving reliable 
communication over noisy channels. 


Exercise 13.5.13] Prove that there exist families of codes with ‘bad’ distance 
that are ‘very good’ codes. 


13.8 Distance isn’t everything 


Let’s get a quantitative feeling for the effect of the minimum distance of a 
code, for the special case of a binary symmetric channel. 


The error probability associated with one low-weight codeword 


Let a binary code have blocklength N and just two codewords, which differ in 
d places. For simplicity, let’s assume d is even. What is the error probability 
if this code is used on a binary symmetric channel with noise level f? 

Bit flips matter only in places where the two codewords differ. The error 
probability is dominated by the probability that d/2 of these bits are flipped. 
What happens to the other bits is irrelevant, since the optimal decoder ignores 
them. 


P(block error) ~ ( : jera Fez (13.22) 


d/2 
This error probability associated with a single codeword of weight d is plotted 
in figure 13.15. Using the approximation for the binomial coefficient (1.16), 
we can further approximate 


P(block error) ~ Ca =e i (13.23) 


= TEF, 


where (f) = 2f1/2(1 — f)!/? is called the Bhattacharyya parameter of the 
channel. 

Now, consider a general linear code with distance d. Its block error prob- 
ability must be at least ( a) f4/2(1 — f)4/?, independent of the blocklength 
N of the code. For this reason, a sequence of codes of increasing blocklength 
N and constant distance d (i.e., ‘very bad’ distance) cannot have a block er- 
ror probability that tends to zero, on any binary symmetric channel. If we 
are interested in making superb error-correcting codes with tiny, tiny error 
probability, we might therefore shun codes with bad distance. However, being 
pragmatic, we should look more carefully at figure 13.15. In Chapter 1 we 
argued that codes for disk drives need an error probability smaller than about 
10718. If the raw error probability in the disk drive is about 0.001, the error 
probability associated with one codeword at distance d = 20 is smaller than 
10-74. If the raw error probability in the disk drive is about 0.01, the error 
probability associated with one codeword at distance d = 30 is smaller than 
10720. For practical purposes, therefore, it is not essential for a code to have 
good distance. For example, codes of blocklength 10000, known to have many 
codewords of weight 32, can nevertheless correct errors of weight 320 with tiny 
error probability. 


(13.24) 
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Figure 13.14. The bit error 
probabilities versus the rates R of 
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for the binary symmetric channel 
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Shannon limit for this channel. 
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I wouldn’t want you to think I am recommending the use of codes with 
bad distance; in Chapter 47 we will discuss low-density parity-check codes, my 
favourite codes, which have both excellent performance and good distance. 


> 13.9 The union bound 


The error probability of a code on the binary symmetric channel can be 
bounded in terms of its weight enumerator function by adding up appropriate 
multiples of the error probability associated with a single codeword (13.24): 


P(block error) < 5 A(w)[B(f)]”. (13.25) 
w>0 


This inequality, which is an example of a union bound, is accurate for low 
noise levels f, but inaccurate for high noise levels, because it overcounts the 
contribution of errors that cause confusion with more than one codeword at a 
time. 


> Exercise 13.6.1] Poor man’s noisy-channel coding theorem. 


Pretending that the union bound (13.25) is accurate, and using the aver- 
age weight enumerator function of a random linear code (13.14) (section 
13.5) as A(w), estimate the maximum rate Rup(f) at which one can 
communicate over a binary symmetric channel. 


Or, to look at it more positively, using the union bound (13.25) as an 
inequality, show that communication at rates up to Ryp(f) is possible 
over the binary symmetric channel. 


In the following chapter, by analysing the probability of error of syndrome 
decoding for a binary linear code, and using a union bound, we will prove 
Shannon’s noisy-channel coding theorem (for symmetric binary channels), and 
thus show that very good linear codes exist. 


> 13.10 Dual codes 


A concept that has some importance in coding theory, though we will have 
no immediate use for it in this book, is the idea of the dual of a linear error- 
correcting code. 

An (N,K) linear error-correcting code can be thought of as a set of 2% 
codewords generated by adding together all combinations of K independent 
basis codewords. The generator matrix of the code consists of those K basis 
codewords, conventionally written as row vectors. For example, the (7,4) 
Hamming code’s generator matrix (from p.10) is 


1000101 
0Oo100110 

c= 0010111 (13.26) 
0001011 


and its sixteen codewords were displayed in table 1.14 (p.9). The code- 
words of this code are linear combinations of the four vectors [100010 1], 
[0100110], [0010111], and [0001011]. 

An (N, K) code may also be described in terms of an M x N parity-check 
matrix (where M = N — K) as the set of vectors {t} that satisfy 


Ht =0. (13.27) 
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One way of thinking of this equation is that each row of H specifies a vector 
to which t must be orthogonal if it is a codeword. 





The generator matrix specifies K vectors from which all codewords 


can be built, and the parity-check matrix specifies a set of M vectors 


to which all codewords are orthogonal. 


The dual of a code is obtained by exchanging the generator matrix 
and the parity-check matrix. 





Definition. The set of all vectors of length N that are orthogonal to all code- 
words in a code, C, is called the dual of the code, Ct. 


If t is orthogonal to hy and hg, then it is also orthogonal to hg = h; + hg; 
so all codewords are orthogonal to any linear combination of the M rows of 
H. So the set of all linear combinations of the rows of the parity-check matrix 
is the dual code. 

For our Hamming (7,4) code, the parity-check matrix is (from p.12): 


1110100 
H=|P k]=|0111010 (13.28) 
1011001 


The dual of the (7,4) Hamming code 17,4) is the code shown in table 13.16. 


0000000 0101101 1001110 1100011 Table 13.16. The eight codewords 
0010111 0111010 1011001 1110100 of the dual of the (7,4) Hamming 
ee LLL code. [Compare with table 1.14, 
p.9.] 


A possibly unexpected property of this pair of codes is that the dual, 
HE, 4)? is contained within the code 17,4) itself: every word in the dual code 
is a codeword of the original (7,4) Hamming code. This relationship can be 
written using set notation: 


Ha) C Haa. (13.29) 


The possibility that the set of dual vectors can overlap the set of codeword 
vectors is counterintuitive if we think of the vectors as real vectors — how can 
a vector be orthogonal to itself? But when we work in modulo-two arithmetic, 
many non-zero vectors are indeed orthogonal to themselves! 


> Exercise 13.7.4 P223] Give a simple rule that distinguishes whether a binary 
vector is orthogonal to itself, as is each of the three vectors [1 11010 0], 
[0111010], and [1011001]. 


Some more duals 
In general, if a code has a systematic generator matrix, 

G = [Ix|P'], (13.30) 
where P isa K x M matrix, then its parity-check matrix is 


H = [P|Im]. (13.31) 
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Example 13.8. The repetition code R3 has generator matrix 





G=|1 1 1]; (13.32) 
its parity-check matrix is 
LEO] 
"=|; oal (13.33) 
The two codewords are [1 1 1] and [0 0 OJ. 


The dual code has generator matrix 


i es. _ f110 
G -ji o Al (13.34) 
or equivalently, modifying G+ into systematic form by row additions, 
it (vO A 
G el i ile (13.35) 


We call this dual code the simple parity code P3; it is the code with one 
parity-check bit, which is equal to the sum of the two source bits. The 
dual code’s four codewords are [1 1 0], [1 0 1], [0 0 0], and [0 1 1]. 


In this case, the only vector common to the code and the dual is the 
all-zero codeword. 


Goodness of duals 


If a sequence of codes is ‘good’, are their duals good too? Examples can be 
constructed of all cases: good codes with good duals (random linear codes); 
bad codes with bad duals; and good codes with bad duals. The last category 
is especially important: many state-of-the-art codes have the property that 
their duals are bad. The classic example is the low-density parity-check code, 
whose dual is a low-density generator-matrix code. 


> Exercise 13.9.19] Show that low-density generator-matrix codes are bad. A 
family of low-density generator-matrix codes is defined by two param- 
eters j, k, which are the column weight and row weight of all rows and 
columns respectively of G. These weights are fixed, independent of N; 
for example, (j,k) = (3,6). [Hint: show that the code has low-weight 
codewords, then use the argument from p.215.] 


Exercise 13.10.17] Show that low-density parity-check codes are good, and have 
good distance. (For solutions, see Gallager (1963) and MacKay (1999b).) 


Self-dual codes 


The (7,4) Hamming code had the property that the dual was contained in the 
code itself. A code is self-orthogonal if it is contained in its dual. For example, 
the dual of the (7,4) Hamming code is a self-orthogonal code. One way of 
seeing this is that the overlap between any pair of rows of H is even. Codes that 
contain their duals are important in quantum error-correction (Calderbank 
and Shor, 1996). 

It is intriguing, though not necessarily useful, to look at codes that are 
self-dual. A code C is self-dual if the dual of the code is identical to the code. 


c+ =C. (13.36) 


Some properties of self-dual codes can be deduced: 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


13.11: Generalizing perfectness to other channels 219 


1. If a code is self-dual, then its generator matrix is also a parity-check 
matrix for the code. 
2. Self-dual codes have rate 1/2, i.e., M = K = N/2. 


3. All codewords have even weight. 


> Exercise 13.11.[ P-223] What property must the matrix P satisfy, if the code 
with generator matrix G = [Ix|P"] is self-dual? 


Examples of self-dual codes 


1. The repetition code Rg is a simple example of a self-dual code. 
Ga Hes [a 1]. (13.37) 


2. The smallest non-trivial self-dual code is the following (8,4) code. 


1 0 0 0ļ0111 
O 10oọļ1011 

= Tf 

Soe Pele e-ovane a tod ee) 
OOO ta toto 


> Exercise 13.12.!? P-?23] Find the relationship of the above (8,4) code to the 
(7,4) Hamming code. 


Duals and graphs 


Let a code be represented by a graph in which there are nodes of two types, 
parity-check constraints and equality constraints, joined by edges which rep- 
resent the bits of the code (not all of which need be transmitted). 

The dual code’s graph is obtained by replacing all parity-check nodes by 
equality nodes and vice versa. This type of graph is called a normal graph by 
Forney (2001). 


Further reading 


Duals are important in coding theory because functions involving a code (such 
as the posterior distribution over codewords) can be transformed by a Fourier 
transform into functions over the dual code. For an accessible introduction 
to Fourier analysis on finite groups, see Terras (1999). See also MacWilliams 
and Sloane (1977). 


> 13.11 Generalizing perfectness to other channels 


Having given up on the search for perfect codes for the binary symmetric 

channel, we could console ourselves by changing channel. We could call a 

code ‘a perfect u-error-correcting code for the binary erasure channel’ if it 

can restore any u erased bits, and never more than u. Rather than using the In a perfect u-error-correcting 

word perfect, however, the conventional term for such a code is a ‘maximum code for the binary erasure 

distance separable code’, or MDS code. channel, the number of redundant 
As we already noted in exercise 11.10 (p.190), the (7,4) Hamming code is bite mugt Dende etl 

not an MDS code. It can recover some sets of 3 erased bits, but not all. If 

any 3 bits corresponding to a codeword of weight 3 are erased, then one bit of 

information is unrecoverable. This is why the (7,4) code is a poor choice for 

a RAID system. 
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A tiny example of a maximum distance separable code is the simple parity- 
check code P whose parity-check matrix is H = [111]. This code has 4 
codewords, all of which have even parity. All codewords are separated by 
a distance of 2. Any single erased bit can be restored by setting it to the 
parity of the other two bits. The repetition codes are also maximum distance 
separable codes. 


> Exercise 13.13.15 P-??4] Can you make an (N,K) code, with M = N- K 
parity symbols, for a g-ary erasure channel, such that the decoder can 
recover the codeword when any M symbols are erased in a block of N? 
(Example: for the channel with q = 4 symbols there is an (N, K) = (5,2) 
code which can correct any M = 3 erasures.| 


For the q-ary erasure channel with q > 2, there are large numbers of MDS 
codes, of which the Reed-Solomon codes are the most famous and most widely 
used. As long as the field size q is bigger than the blocklength N, MDS block 
codes of any rate can be found. (For further reading, see Lin and Costello 
(1983).) 


> 13.12 Summary 


Shannon’s codes for the binary symmetric channel can almost always correct 
ÍN errors, but they are not fN-error-correcting codes. 


Reasons why the distance of a code has little relevance 


1. The Shannon limit shows that the best codes must be able to cope with 
a noise level twice as big as the maximum noise level for a bounded- 
distance decoder. 


2. When the binary symmetric channel has f > 1/4, no code with a 
bounded-distance decoder can communicate at all; but Shannon says 
good codes exist for such channels. 


3. Concatenation shows that we can get good performance even if the dis- 
tance is bad. 


The whole weight enumerator function is relevant to the question of 
whether a code is a good code. 

The relationship between good codes and distance properties is discussed 
further in exercise 13.14 (p.220). 


> 13.13 Further exercises 


Exercise 13.14.!% P-??4] A codeword t is selected from a linear (N, K) code 
C, and it is transmitted over a noisy channel; the received signal is y. 
We assume that the channel is a memoryless channel such as a Gaus- 
sian channel. Given an assumed channel model P(y |t), there are two 
decoding problems. 


The codeword decoding problem is the task of inferring which 
codeword t was transmitted given the received signal. 


The bitwise decoding problem is the task of inferring for each 
transmitted bit tp how likely it is that that bit was a one rather 
than a zero. 
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13.13: Further exercises 


Consider optimal decoders for these two decoding problems. Prove that 
the probability of error of the optimal bitwise-decoder is closely related 
to the probability of error of the optimal codeword-decoder, by proving 
the following theorem. 


Theorem 13.1 If a binary linear code has minimum distance din, 
then, for any given channel, the codeword bit error probability of the 
optimal bitwise decoder, pp, and the block error probability of the mazi- 
mum likelihood decoder, pp, are related by: 





1 dmin 
PB 2 Pb 2 37N PB- (13.39) 


Exercise 13.15.!4] What are the minimum distances of the (15,11) Hamming 
= code and the (31,26) Hamming code? 


> Exercise 13.16.1°] Let A(w) be the average weight enumerator function of a 
rate-1/3 random linear code with N = 540 and M = 360. Estimate, 
from first principles, the value of A(w) at w = 1. 


Exercise 13.17.120] A code with minimum distance greater than davy. A rather 
nice (15,5) code is generated by this generator matrix, which is based 
on measuring the parities of all the (3) = 10 triplets of source bits: 


Le By Hass Se! a A ols ale. al 
hy pet Tan ee ET SA Sy aE al 

G= fe A aes ene a ae | (13.40) 
tt, oth CA de ae Se SS oa 
111l 1 l 


Find the minimum distance and weight enumerator function of this code. 


Exercise 13.18.19] Find the minimum distance of the ‘pentagonful’ low- 
density parity-check code whose parity-check matrix is 





Show that nine of the ten rows are independent, so the code has param- 
eters N = 15, K = 6. Using a computer, find its weight enumerator 
function. 


> Exercise 13.19.150] Replicate the calculations used to produce figure 13.12. 
Check the assertion that the highest noise level that’s correctable is 
0.0588. Explore alternative concatenated sequences of codes. Can you 
find a better sequence of concatenated codes — better in the sense that it 
has either higher asymptotic rate R or can tolerate a higher noise level 


f? 


221 





Figure 13.17. The graph of the 
pentagonful low-density 
parity-check code with 15 bit 
nodes (circles) and 10 parity-check 
nodes (triangles). [This graph is 
known as the Petersen graph.] 
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Exercise 13.20.|% P-226] Investigate the possibility of achieving the Shannon 
>= limit with linear block codes, using the following counting argument. 
Assume a linear code of large blocklength N and rate R = K/N. The 
code’s parity-check matrix H has M = N — K rows. Assume that the 
code’s optimal decoder, which solves the syndrome decoding problem 
Hn = z, allows reliable communication over a binary symmetric channel 

with flip probability f. 


How many ‘typical’ noise vectors n are there? 
Roughly how many distinct syndromes z are there? 


Since n is reliably deduced from z by the optimal decoder, the number 
of syndromes must be greater than or equal to the number of typical 
noise vectors. What does this tell you about the largest possible value 
of rate R for a given f? 


> Exercise 13.21.[?] Linear binary codes use the input symbols 0 and 1 with 
equal probability, implicitly treating the channel as a symmetric chan- 
nel. Investigate how much loss in communication rate is caused by this 
assumption, if in fact the channel is a highly asymmetric channel. Take 
as an example a Z-channel. How much smaller is the maximum possible 
rate of communication using symmetric inputs than the capacity of the 
channel? [Answer: about 6%.] 


Exercise 13.22.!7] Show that codes with ‘very bad’ distance are ‘bad’ codes, as 
defined in section 11.4 (p.183). 


Exercise 13.23.19] One linear code can be obtained from another by punctur- 
ing. Puncturing means taking each codeword and deleting a defined set 
of bits. Puncturing turns an (N, K) code into an (N’, K) code, where 
N'<N. 


Another way to make new linear codes from old is shortening. Shortening 
means constraining a defined set of bits to be zero, and then deleting 
them from the codewords. Typically if we shorten by one bit, half of the 
code’s codewords are lost. Shortening typically turns an (N, K) code 
into an (N’, K’) code, where N — N’ = K — kK’. 


Another way to make a new linear code from two old ones is to make 
the intersection of the two codes: a codeword is only retained in the new 
code if it is present in both of the two old codes. 


Discuss the effect on a code’s distance-properties of puncturing, short- 
ening, and intersection. Is it possible to turn a code family with bad 
distance into a code family with good distance, or vice versa, by each of 
these three manipulations? 


Exercise 13.24.19 P-226] Todd Ebert's ‘hat puzzle’. 


Three players enter a room and a red or blue hat is placed on each 
person’s head. The colour of each hat is determined by a coin toss, with 
the outcome of one coin toss having no effect on the others. Each person 
can see the other players’ hats but not his own. 


No communication of any sort is allowed, except for an initial strategy 
session before the group enters the room. Once they have had a chance 
to look at the other hats, the players must simultaneously guess their 


13.14: Solutions 


own hat’s colour or pass. The group shares a $3 million prize if at least 
one player guesses correctly and no players guess incorrectly. 


The same game can be played with any number of players. The general 
problem is to find a strategy for the group that maximizes its chances of 
winning the prize. Find the best strategies for groups of size three and 
seven. 


int: when you've done three and seven, you might be able to solve 
Hi h ve d th d ight be abl l 
fifteen.] 


Exercise 13.25.19] Estimate how many binary low-density parity-check codes 
have self-orthogonal duals. [Note that we don’t expect a huge number, 
since almost all low-density parity-check codes are ‘good’, but a low- 
density parity-check code that contains its dual must be ‘bad’.] 


Exercise 13.26.1201 In figure 13.15 we plotted the error probability associated 
with a single codeword of weight d as a function of the noise level f of a 
binary symmetric channel. Make an equivalent plot for the case of the 
Gaussian channel, showing the error probability associated with a single 
codeword of weight d as a function of the rate-compensated signal-to- 
noise ratio E,/No. Because Æp/No depends on the rate, you have to 
choose a code rate. Choose R = 1/2, 2/3, 3/4, or 5/6. 


> 13.14 Solutions 


Solution to exercise 13.4 (p.210). 
order is pg = $(5) fe. 


The probability of block error to leading 


Solution to exercise 13.7 (p.217). A binary vector is perpendicular to itself if 
it has even weight, i.e., an even number of 1s. 


Solution to exercise 13.11 (p.219). The self-dual code has two equivalent 
parity-check matrices, Hı = G = [Ix|P"] and Hz = [P|Ix]; these must be 
equivalent to each other through row additions, that is, there is a matrix U 
such that UH» = Hi, so 


[UP|UIx] = [Ix|P"). (13.42) 


From the right-hand sides of this equation, we have U = P’, so the left-hand 
sides become: 


P'P =I. (13.43) 


Thus if a code with generator matrix G = [Ix|P"] is self-dual then P is an 
orthogonal matrix, modulo 2, and vice versa. 


Solution to exercise 13.12 (p.219). The (8,4) and (7,4) codes are intimately 
related. The (8,4) code, whose parity-check matrix is 


H=[P|i,]= : (13.44) 


erer oO 
PROF 
FORE 
Or Ff FE 
oo0oo0or 
ooro 
OR Oo 
rFOoOO°O 


is obtained by (a) appending an extra parity-check bit which can be thought 
of as the parity of all seven bits of the (7,4) Hamming code; and (b) reordering 
the first four bits. 
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If you already know the hat 
puzzle, you could try the ‘Scottish 
version’ of the rules in which the 
prize is only awarded to the group 
if they all guess correctly. 

In the ‘Reformed Scottish 
version’, all the players must 
guess correctly, and there are two 
rounds of guessing. Those players 
who guess during round one leave 
the room. The remaining players 
must guess in round two. What 
strategy should the team adopt to 
maximize their chance of winning? 
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Solution to exercise 13.13 (p.220). Ifan (N, K) code, with M = N — K parity 
symbols, has the property that the decoder can recover the codeword when any 
M symbols are erased in a block of N, then the code is said to be maximum 
distance separable (MDS). 

No MDS binary codes exist, apart from the repetition codes and simple 
parity codes. For q > 2, some MDS codes can be found. 

As a simple example, here is a (9,2) code for the 8-ary erasure channel. 
The code is defined in terms of the multiplication and addition rules of GF(8), 
which are given in Appendix C.1. The elements of the input alphabet are 
{0,1, A,B,C, D, E, F} and the generator matrix of the code is 


101ABC DEF 


G=|9 11111111 


(13.45) 


The resulting 64 codewords are: 


000000000 011111111 OAAAAAAAA OBBBBBBBB OCCCCCCCC ODDDDDDDD OEEEEEEEE OFFFFFFFF 
101ABCDEF 110BADCFE 1ABO1EFCD 1BA10FEDC 1CDEFO1AB 1DCFE10BA 1EFCDABO1 1FEDCBA10 
AOACEBiFD A1BDFAOEC AAOECiIBDF ABiFDOACE ACEOAFDBi ADF1BECAO AECAODF1B AFDBiCEOA 
BOBEDFC1A BiAFCEDOB BA1CFDEBO BBODECFAi1 BCFAIBODE BDEBOA1CF BEDOB1AFC BFC1AOBED 
COCBFEAD1 CiDAEFBCO CAEIDCOFB CBFOCDIEA CCOFBAE1D CD1EABFOC CEAD1OCBF CFBCO1DAE 
DODiCAFBE DiCODBEAF DAFBEODi1C DBEAFiCOD DCi1DOEBFA DDOCiFAEB DEBFAC1DO DFAEBDOC1 
EOEF1DBAC E1FEOCABD EACDBF10E EBDCAEO1F ECABDiFEO EDBACOEF1 EEO1FBDCA EF10EACDB 
FOFDA1ECB F1ECBOFDA FADFOBCE1 FBCE1ADFO FCBiEDAOF FDAOFCB1E FE1BCFOAD FFOADE1BC 


Solution to exercise 13.14 (p.220). Quick, rough proof of the theorem. Let x 
denote the difference between the reconstructed codeword and the transmitted 
codeword. For any given channel output r, there is a posterior distribution 
over x. This posterior distribution is positive only on vectors x belonging 
to the code; the sums that follow are over codewords x. The block error 
probability is: 

pp = >) P(x|r). (13.46) 

x40 


The average bit error probability, averaging over all bits in the codeword, is: 


py = D> P(x|x) wes) (13.47) 


xA0 





where w(x) is the weight of codeword x. Now the weights of the non-zero 


codewords satisfy 
w (x) dmin 








1 > —— 3 13.4 
=Z N ~ N Ue) 
Substituting the inequalities (13.48) into the definitions (13.46, 13.47), we ob- 
tain: d 

PB 2 Ph 2 Pp, (13.49) 


which is a factor of two stronger, on the right, than the stated result (13.39). 
In making the proof watertight, I have weakened the result a little. 


Careful proof. The theorem relates the performance of the optimal block de- 
coding algorithm and the optimal bitwise decoding algorithm. 

We introduce another pair of decoding algorithms, called the block- 
guessing decoder and the bit-guessing decoder. The idea is that these two 
algorithms are similar to the optimal block decoder and the optimal bitwise 
decoder, but lend themselves more easily to analysis. 

We now define these decoders. Let x denote the inferred codeword. For 
any given code: 
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The optimal block decoder returns the codeword x that maximizes the 
posterior probability P(x|r), which is proportional to the likelihood 
P(r |x). 


The probability of error of this decoder is called pp. 


The optimal bit decoder returns for each of the N bits, zy, the 
value of a that maximizes the posterior probability P(z,=a|r) = 
Xy P(x|r) en =al. 
The probability of error of this decoder is called py. 


The block-guessing decoder returns a random codeword x with probabil- 
ity distribution given by the posterior probability P(x |r). 


The probability of error of this decoder is called pe. 


The bit-guessing decoder returns for each of the N bits, £n, a random bit 
from the probability distribution P(x, =a |r). 


The probability of error of this decoder is called pe. 


The theorem states that the optimal bit error probability pp is bounded above 
by pp and below by a given multiple of pp (13.39). 

The left-hand inequality in (13.39) is trivially true — if a block is correct, all 
its constituent bits are correct; so if the optimal block decoder outperformed 
the optimal bit decoder, we could make a better bit decoder from the block 
decoder. 

We prove the right-hand inequality by establishing that: 


(a) the bit-guessing decoder is nearly as good as the optimal bit decoder: 


De < 2pp. (13.50) 


(b) the bit-guessing decoder’s error probability is related to the block- 
guessing decoder’s by 











ex 
G min G 
= . 13.51 
Pp = N PB ( ) 
Then since pS > pp, we have 

1 G 1 dmin G 1 dmin 
-p > = >- . 13.52 
PO BPR eo ee oe) 


We now prove the two lemmas. 

Near-optimality of guessing: Consider first the case of a single bit, with posterior 

probability {po, pı}. The optimal bit decoder has probability of error 
poptimal — min(po, pı). (13.53) 


The guessing decoder picks from 0 and 1. The truth is also distributed with 
the same probability. The probability that the guesser and the truth match is 
pe + Pt the probability that they mismatch is the guessing error probability, 


Pees — 2pop; < 2 min(po, pı) = 2P°Ptimal, (13.54) 


Since pE is the average of many such error probabilities, P8955, and pp is the 
average of the corresponding optimal error probabilities, PoPtimal we obtain 
the desired relationship (13.50) between pe and pp. o 
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226 13 — Binary Codes 


Relationship between bit error probability and block error probability: The bit- 
guessing and block-guessing decoders can be combined in a single system: we 
can draw a sample £n from the marginal distribution P(x, |r) by drawing 
a sample (x,,x) from the joint distribution P(zn,x |r), then discarding the 
value of x. 

We can distinguish between two cases: the discarded value of x is the 
correct codeword, or not. The probability of bit error for the bit-guessing 
decoder can then be written as a sum of two terms: 


ps = P(x correct) P(bit error |x correct) 


+ P(x incorrect) P(bit error |x incorrect) 


0 + p§ P(bit error |x incorrect). 


Now, whenever the guessed x is incorrect, the true x must differ from it in at 
least d bits, so the probability of bit error in these cases is at least d/N. So 


d 
G G 
> r 
Pp Z NPB 





QED. o 


Solution to exercise 13.20 (p.222). The number of ‘typical’ noise vectors n 
is roughly 2N#2(%). The number of distinct syndromes z is 2”. So reliable 
communication implies 


M > NEh(S), (13.55) 
or, in terms of the rate R = 1 — M/N, 


R<1-H(f), (13.56) 


a bound which agrees precisely with the capacity of the channel. 
This argument is turned into a proof in the following chapter. 


Solution to exercise 13.24 (p.222). In the three-player case, it is possible for 
the group to win three-quarters of the time. 

Three-quarters of the time, two of the players will have hats of the same 
colour and the third player’s hat will be the opposite colour. The group can 
win every time this happens by using the following strategy. Each player looks 
at the other two players’ hats. If the two hats are different colours, he passes. 
If they are the same colour, the player guesses his own hat is the opposite 
colour. 

This way, every time the hat colours are distributed two and one, one 
player will guess correctly and the others will pass, and the group will win the 
game. When all the hats are the same colour, however, all three players will 
guess incorrectly and the group will lose. 

When any particular player guesses a colour, it is true that there is only a 
50:50 chance that their guess is right. The reason that the group wins 75% of 
the time is that their strategy ensures that when players are guessing wrong, 
a great many are guessing wrong. 

For larger numbers of players, the aim is to ensure that most of the time 
no one is wrong and occasionally everyone is wrong at once. In the game with 
7 players, there is a strategy for which the group wins 7 out of every 8 times 
they play. In the game with 15 players, the group can win 15 out of 16 times. 
If you have not figured out these winning strategies for teams of 7 and 15, 
I recommend thinking about the solution to the three-player game in terms 
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of the locations of the winning and losing states on the three-dimensional 
hypercube, then thinking laterally. 

If the number of players, N, is 2” — 1, the optimal strategy can be defined 
using a Hamming code of length N, and the probability of winning the prize 
is N/(N + 1). Each player is identified with a number n € 1...N. The two 
colours are mapped onto 0 and 1. Any state of their hats can be viewed as a 
received vector out of a binary channel. A random binary vector of length N 
is either a codeword of the Hamming code, with probability 1/(N + 1), or it 
differs in exactly one bit from a codeword. Each player looks at all the other 
bits and considers whether his bit can be set to a colour such that the state is 
a codeword (which can be deduced using the decoder of the Hamming code). 
If it can, then the player guesses that his hat is the other colour. If the state is 
actually a codeword, all players will guess and will guess wrong. If the state is 
a non-codeword, only one player will guess, and his guess will be correct. It’s 
quite easy to train seven players to follow the optimal strategy if the cyclic 
representation of the (7,4) Hamming code is used (p.19). 
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About Chapter 14 


In this chapter we will draw together several ideas that we’ve encountered 
so far in one nice short proof. We will simultaneously prove both Shannon’s 
noisy-channel coding theorem (for symmetric binary channels) and his source 
coding theorem (for binary sources). While this proof has connections to many 
preceding chapters in the book, it’s not essential to have read them all. 

On the noisy-channel coding side, our proof will be more constructive than 
the proof given in Chapter 10; there, we proved that almost any random code 
is ‘very good’. Here we will show that almost any linear code is very good. We 
will make use of the idea of typical sets (Chapters 4 and 10), and we’ll borrow 
from the previous chapter’s calculation of the weight enumerator function of 
random linear codes (section 13.5). 

On the source coding side, our proof will show that random linear hash 
functions can be used for compression of compressible binary sources, thus 
giving a link to Chapter 12. 
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Very Good Linear Codes Exist 


In this chapter we’ll use a single calculation to prove simultaneously the source 
coding theorem and the noisy-channel coding theorem for the binary symmet- 
ric channel. 

Incidentally, this proof works for much more general channel models, not 
only the binary symmetric channel. For example, the proof can be reworked 
for channels with non-binary outputs, for time-varying channels and for chan- 
nels with memory, as long as they have binary inputs satisfying a symmetry 
property, cf. section 10.6. 


> 14.1 A simultaneous proof of the source coding and noisy-channel 
coding theorems 


We consider a linear error-correcting code with binary parity-check matrix H. 
The matrix has M rows and N columns. Later in the proof we will increase 
N and M, keeping M œ N. The rate of the code satisfies 


M 
> 1- —. 14.1 
R21- (14.1) 


If all the rows of H are independent then this is an equality, R = 1 — M/N. 
In what follows, we’ll assume the equality holds. Eager readers may work out 
the expected rank of a random binary matrix H (it’s very close to M) and 
pursue the effect that the difference (M — rank) has on the rest of this proof 
(it’s negligible). 

A codeword t is selected, satisfying 


Ht = Omod 2, (14.2) 
and a binary symmetric channel adds noise x, giving the received signal In this chapter x denotes the 
noise added by the channel, not 
r=t+xmod2. (14.3) the input to the channel. 


The receiver aims to infer both t and x from r using a syndrome-decoding 
approach. Syndrome decoding was first introduced in section 1.2 (p.10 and 
11). The receiver computes the syndrome 


z = Hr mod 2 = Ht + Hxmod2 = Hx mod 2. (14.4) 


The syndrome only depends on the noise x, and the decoding problem is to 
find the most probable x that satisfies 


Hx = zmod 2. (14.5) 
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This best estimate for the noise vector, x, is then subtracted from r to give the 
best guess for t. Our aim is to show that, as long as R < 1-H(X) = 1—Ho/(f), 
where f is the flip probability of the binary symmetric channel, the optimal 
decoder for this syndrome-decoding problem has vanishing probability of error, 
as N increases, for random H. 
We prove this result by studying a sub-optimal strategy for solving the 
decoding problem. Neither the optimal decoder nor this typical-set decoder 
would be easy to implement, but the typical-set decoder is easier to analyze. 
The typical-set decoder examines the typical set T of noise vectors, the set 
of noise vectors x’ that satisfy log 1/P(x’) ~ NH(X), checking to see if any of We’ll leave out the es and (s that 


those typical vectors x’ satisfies the observed syndrome, make a typical-set definition 
rigorous. Enthusiasts are 
Hx’ =z. (14.6) encouraged to revisit section 4.4 


and put these details into this 
If exactly one typical vector x’ does so, the typical set decoder reports that Proof. 
vector as the hypothesized noise vector. If no typical vector matches the 
observed syndrome, or more than one does, then the typical set decoder reports 
an error. 
The probability of error of the typical-set decoder, for a given matrix H, 
can be written as a sum of two terms, 
Prsa = PO + Pir (14.7) 
where PC) is the probability that the true noise vector x is itself not typical, 
and ae is the probability that the true x is typical and at least one other 
typical vector clashes with it. The first probability vanishes as N increases, as 
we proved when we first studied typical sets (Chapter 4). We concentrate on 
the second probability. To recap, we’re imagining a true noise vector, x; and 
if any of the typical noise vectors x’, different from x, satisfies H(x’ — x) = 0, 
then we have an error. We use the truth function 


1[H(x' — x) = 0], (14.8) 


whose value is one if the statement H(x’ — x) = 0 is true and zero otherwise. 
We can bound the number of type II errors made when the noise is x thus: 


[Number of errors given x and H] < 5o 1[H(x' — x) =0]. (14.9) 
The number of errors is either zero or one; the sum on the right-hand side 
may exceed one, in cases where several typical noise vectors have the same Equation (14.9) is a union bound. 
syndrome. 
We can now write down the probability of a type-II error by averaging over 


PLD < EPE SD [He -x) =0). (14.10) 
XET x’: Se 


Now, we will find the average of this probability of type-II error over all linear 
codes by averaging over H. By showing that the average probability of type-II 
error vanishes, we will thus show that there exist linear codes with vanishing 
error probability, indeed, that almost all linear codes are very good. 

We denote averaging over all binary matrices H by (...);;. The average 
probability of type-II error is 


pI) D (II) 
Re = PP ga = A (14.11) 
H 
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(DP > [Hœ - x) =0)) (14.12) 


xEeET x: x'ET 


xx H 
= $ PŒ) SO Oe -x) =0]),. (14.13) 
xeT x’: eee 


Now, the quantity (1[H(x’ — x) = 0])y already cropped up when we were 
calculating the expected weight enumerator function of random linear codes 
(section 13.5): for any non-zero binary vector v, the probability that Hv = 0, 
averaging over all matrices H, is 27™. So 


PED = (Zr) (T| -1)27™ (14.14) 


xET 
pases, (14.15) 


IA 


where |T| denotes the size of the typical set. As you will recall from Chapter 


4, there are roughly 2N"(*) noise vectors in the typical set. So 
5 (II NH -M 
PED < ee (14.16) 


This bound on the probability of error either vanishes or grows exponentially 
as N increases (remembering that we are keeping M proportional to N as N 
increases). It vanishes if 

H(X) < M/N. (14.17) 
Substituting R = 1— M/N, we have thus established the noisy-channel coding 
theorem for the binary symmetric channel: very good linear codes exist for 
any rate R satisfying 

R<1-H(X), (14.18) 


where H(X) is the entropy of the channel noise, per bit. 














Exercise 14.1.1] Redo the proof for a more general channel. 


> 14.2 Data compression by linear hash codes 


The decoding game we have just played can also be viewed as an uncompres- 
sion game. The world produces a binary noise vector x from a source P(x). 
The noise has redundancy (if the flip probability is not 0.5). We compress it 
with a linear compressor that maps the N-bit input x (the noise) to the M-bit 
output z (the syndrome). Our uncompression task is to recover the input x 
from the output z. The rate of the compressor is 


Reompressor = M/N. (14.19) 


[We don’t care about the possibility of linear redundancies in our definition 
of the rate, here.] The result that we just found, that the decoding problem 
can be solved, for almost any H, with vanishing error probability, as long as 
H(X) < M/N, thus instantly proves a source coding theorem: 


Given a binary source X of entropy H(X), and a required com- 
pressed rate R > H(X), there exists a linear compressor x > Z = 
Hx mod 2 having rate M/N equal to that required rate R, and an 
associated uncompressor, that is virtually lossless. 


This theorem is true not only for a source of independent identically dis- 
tributed symbols but also for any source for which a typical set can be de- 
fined: sources with memory, and time-varying sources, for example; all that’s 
required is that the source be ergodic. 
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Notes 


This method for proving that codes are good can be applied to other linear 
codes, such as low-density parity-check codes (MacKay, 1999b; Aji et al., 2000). 
For each code we need an approximation of its expected weight enumerator 


function. 
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Further Exercises on Information Theory 


The most exciting exercises, which will introduce you to further ideas in in- 
formation theory, are towards the end of this chapter. 


Refresher exercises on source coding and noisy channels 


> Exercise 15.1.!7] Let X be an ensemble with Ax = {0,1} and Px = 
{0.995, 0.005}. Consider source coding using the block coding of X 1 
where every x € X! containing 3 or fewer 1s is assigned a distinct 
codeword, while the other xs are ignored. 


(a) If the assigned codewords are all of the same length, find the min- 
imum length required to provide the above set with distinct code- 
words. 


(b) Calculate the probability of getting an x that will be ignored. 


> Exercise 15.2.7] Let X be an ensemble with Px = {0.1,0.2,0.3,0.4}. The en- 
semble is encoded using the symbol code C = {0001, 001, 01,1}. Consider 
the codeword corresponding to x € X^, where N is large. 


(a) Compute the entropy of the fourth bit of transmission. 


(b) Compute the conditional entropy of the fourth bit given the third 
bit. 
(c) Estimate the entropy of the hundredth bit. 


(d) Estimate the conditional entropy of the hundredth bit given the 
ninety-ninth bit. 


Exercise 15.3.!7] Two fair dice are rolled by Alice and the sum is recorded. 

= Bob’s task is to ask a sequence of questions with yes/no answers to find 

out this number. Devise in detail a strategy that achieves the minimum 
possible average number of questions. 


[2] 


> Exercise 15.4.'~! How can you use a coin to draw straws among 3 people? 


> Exercise 15.5.[?] In a magic trick, there are three participants: the magician, 
an assistant, and a volunteer. The assistant, who claims to have paranor- 
mal abilities, is in a soundproof room. The magician gives the volunteer 
six blank cards, five white and one blue. The volunteer writes a dif- 
ferent integer from 1 to 100 on each card, as the magician is watching. 
The volunteer keeps the blue card. The magician arranges the five white 
cards in some order and passes them to the assistant. The assistant then 
announces the number on the blue card. 


How does the trick work? 
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> Exercise 15.6.18] How does this trick work? 


‘Here’s an ordinary pack of cards, shuffled into random order. 
Please choose five cards from the pack, any that you wish. 
Don’t let me see their faces. No, don’t give them to me: pass 
them to my assistant Esmerelda. She can look at them. 
‘Now, Esmerelda, show me four of the cards. Hmm... nine 
of spades, six of clubs, four of hearts, ten of diamonds. The 
hidden card, then, must be the queen of spades! 


The trick can be performed as described above for a pack of 52 cards. 
Use information theory to give an upper bound on the number of cards 
for which the trick can be performed. 


> Exercise 15.7.17] 
H(p) = œ. 


Find a probability sequence p = (pı, p2,...) such that 


> Exercise 15.8.[°] Consider a discrete memoryless source with Ax = {a, b,c, d} 
and Px = {1/2,1/4, 1/8, 1/8}. There are 4° = 65 536 eight-letter words 
that can be formed from the four letters. Find the total number of such 
words that are in the typical set Tyg (equation 4.29) where N = 8 and 
B=0.1. 

> Exercise 15.9.!7] Consider the source As =  {a,b,c,d,e$, Ps = 
(1/3, 1/3, 1/9, Yo, 1/9} and the channel whose transition probability matrix 
is 


10 0 0 
00 43 0 

Q=-lo101 (15.1) 
00 18 0 


Note that the source alphabet has five symbols, but the channel alphabet 
Ax = Ay = {0,1,2,3} has only four. Assume that the source produces 
symbols at exactly 3/4 the rate that the channel accepts channel sym- 
bols. For a given (tiny) € > 0, explain how you would design a system 
for communicating the source’s output over the channel with an aver- 
age error probability per source symbol less than e. Be as explicit as 
possible. In particular, do not invoke Shannon’s noisy-channel coding 
theorem. 


> Exercise 15.10.!#] Consider a binary symmetric channel and a code C = 
{0000, 0011, 1100, 1111}; assume that the four codewords are used with 
probabilities {1/2, 1/8, 1/8, 1/4}. 
What is the decoding rule that minimizes the probability of decoding 
error? [The optimal decoding rule depends on the noise level f of the 


binary symmetric channel. Give the decoding rule for each range of 
values of f, for f between 0 and 1/2.] 


Exercise 15.11.!?] Find the capacity and optimal input distribution for the 


> three-input, three-output channel whose transition probabilities are: 
1 0 O 
Q=]|0 33 13]. (15.2) 


0 ls 23 
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Exercise 15.12.19 P239] The input to a channel Q is a word of 8 bits. The 

= output is also a word of 8 bits. Each time it is used, the channel flips 

exactly one of the transmitted bits, but the receiver does not know which 

one. The other seven bits are received without error. All 8 bits are 

equally likely to be the one that is flipped. Derive the capacity of this 
channel. 


Show, by describing an esplicit encoder and decoder that it is possible 
reliably (that is, with zero error probability) to communicate 5 bits per 
cycle over this channel. 


> Exercise 15.13.!#] A channel with input x € {a,b,c} and output y € {r,s,t,u} 
has conditional probability matrix: 


1⁄2 0 0 r 

g 1⁄2 I 0 a. 
Q= 1 o y yf’ Bi 
0 0 2 a 


What is its capacity? 


> Exercise 15.14.19] The ten-digit number on the cover of a book known as the 

ISBN incorporates an error-detecting code. The number consists of nine 
source digits £1, £2,..., £9, Satisfying zn € {0,1,...,9}, and a tenth 0-521-64298-1 
check digit whose value is given by 1-010-00000-4 
9 Table 15.1. Some valid ISBNs. 
r10 = y nzn | mod 11. [The hyphens are included for 

legibility.] 
n=1 
Here x19 € {0,1,...,9,10}. If 21g = 10 then the tenth digit is shown 


using the roman numeral X. 


Show that a valid ISBN satisfies: 


10 
(>: nn) mod1l1=0. 


n=1 
Imagine that an ISBN is communicated over an unreliable human chan- 
nel which sometimes modifies digits and sometimes reorders digits. 


Show that this code can be used to detect (but not correct) all errors in 
which any one of the ten digits is modified (for example, 1-010-00000-4 
— 1-010-00080-4). 


Show that this code can be used to detect all errors in which any two ad- 
jacent digits are transposed (for example, 1-010-00000-4 — 1-100-00000- 
4). 


What other transpositions of pairs of non-adjacent digits can be de- 
tected? 


If the tenth digit were defined to be 


9 
£10 = (>: non) mod 10, 


n=1 


why would the code not work so well? (Discuss the detection of both 
modifications of single digits and transpositions of digits.) 
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Exercise 15.15.17] A channel with input x and output y has transition proba- 


> bility matrix: 
a a 
1- f f 0 0 x 
b b 
ae f 1-—f 0 0 
of 0 0 l-g g on c 
0 0 g l-g d d 


Assuming an input distribution of the form 
_ jp pi-p |l—p 
Px se [z 2? 2 ? 2 \ 9 
write down the entropy of the output, H(Y), and the conditional entropy 
of the output given the input, H(Y|X). 
Show that the optimal input distribution is given by 


1 
P= Ty 2 ROTA’ 
where Ho(f) = f logs + + (1 — f) logs ay: Remember H(p) = logs i. 


Write down the optimal input distribution and the capacity of the chan- 
nel in the case f = 1/2, g = 0, and comment on your answer. 


> Exercise 15.16.!7] What are the differences in the redundancies needed in an 
error-detecting code (which can reliably detect that a block of data has 
been corrupted) and an error-correcting code (which can detect and cor- 
rect errors)? 


Further tales from information theory 


The following exercises give you the chance to discover for yourself the answers 
to some more surprising results of information theory. 


Exercise 15.17.19] Communication of information from correlated sources. Imag- 
ine that we want to communicate data from two data sources X‘4) and X) 
to a central location C via noise-free one-way communication channels (fig- 
ure 15.2a). The signals x) and x) are strongly dependent, so their joint 
information content is only a little greater than the marginal information con- 
tent of either of them. For example, C is a weather collator who wishes to 
receive a string of reports saying whether it is raining in Allerton (1) and 
whether it is raining in Bognor (a()), The joint probability of z^) and zP) 
might be 
P(x), e)): r) 





1 | 0.01 0.49 (15.3) 


(A) (B) 


The weather collator would like to know N successive values of x and x 
exactly, but, since he has to pay for every bit of information he receives, he 
is interested in the possibility of avoiding buying N bits from source A and 
N bits from source B. Assuming that variables 74) and zP) are generated 
repeatedly from this distribution, can they be encoded at rates R4 and Rg in 
such a way that C can reconstruct all the variables, with the sum of information 
transmission rates on the two lines being less than two bits per cycle? 
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encode B(x, x®))| eee Ta e a . 


(A) GA) ‘p A B 
x t B)| os} sources. (a) x) and 2) are 
R A (X ( ) nantes Achievable 
: A x 4 dependent sources (the 


i a : bù ë Y , ependence is representec e 
(a) xB) Sodes t®) a | p | a Strings p of 


Rpg H(x®) | x4) EE. MMM 


| 


ne 


K 


oe each variable are encoded using 


codes of rate R4 and Rp into 
i ; A transmissions t(4) and t@), which 
(b) (A), y(B) A), R are communicated over noise-free 
A(X X H(X A 
( | ) H( ) channels to a receiver C. (b) The 
achievable rate region. Both 


The answer, which you should demonstrate, is indicated in figure 15.2. strings can be conveyed without 
In the general case of two dependent sources X) and X(P), there exist  ©tTOT even eA < H(X®) 
codes for the two transmitters that can achieve reliable communication of and a 
both X and X) to C, as long as: the information rate from X®, 
Ra, exceeds HX) | x@)), the information rate from X‘?), Rg, exceeds 
H(X®) |X); and the total information rate R4 + Rp exceeds the joint 
entropy H(X‘4), x(8)) (Slepian and Wolf, 1973). 
So in the case of x4) and x'¥) above, each transmitter must transmit at 
a rate greater than H2(0.02) = 0.14 bits, and the total rate R4 + Rg must 
be greater than 1.14 bits, for example Ra = 0.6, Rg = 0.6. There exist codes 
that can achieve these rates. Your task is to figure out why this is so. 
Try to find an explicit solution in which one of the sources is sent as plain 
text, tP) = x), and the other is encoded. 








Exercise 15.18. Í] Multiple access channels. Consider a channel with two sets 
of inputs and one output — for example, a shared telephone line (figure 15.3a). 
A simple model system has two binary inputs x“) and zP) and a ternary 
output y equal to the arithmetic sum of the two inputs, that’s 0, 1 or 2. There 
is no noise. Users A and B cannot communicate with each other, and they 
cannot hear the output of the channel. If the output is a 0, the receiver can 
be certain that both inputs were set to 0; and if the output is a 2, the receiver 
can be certain that both inputs were set to 1. But if the output is 1, then 
it could be that the input state was (0,1) or (1,0). How should users A and 
B use this channel so that their messages can be deduced from the received 
signals? How fast can A and B communicate? 

Clearly the total information rate from A and B to the receiver cannot 
be two bits. On the other hand it is easy to achieve a total information rate 
Ra+ Rp of one bit. Can reliable communication be achieved at rates (R4, Rp) 
such that R4 + Rpg > 1? 

The answer is indicated in figure 15.3. 

Some practical codes for multi-user channels are presented in Ratzer and 
MacKay (2003). 


Exercise 15.19.!°] Broadcast channels. A broadcast channel consists of a single 

transmitter and two or more receivers. The properties of the channel are de- 

fined by a conditional distribution Q(y“), y) | x). (We'll assume the channel 

is memoryless.) The task is to add an encoder and two decoders to enable yA) 

reliable communication of a common message at rate Ro to both receivers, an x 

individual message at rate R4 to receiver A, and an individual message at rate 

Rp to receiver B. The capacity region of the broadcast channel is the convex Figure 15:4: The broadcast 

hull of the set of achievable rate triplets (Ro, Ra, RB). chandel: ois the channel inpūt: 
A simple benchmark for such a channel is given by time-sharing (time- y) and yP) are the outputs. 

division signaling). If the capacities of the two channels, considered separately, 


Ny 
yB) 
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Figure 15.3. Multiple access 
channels. (a) A general multiple 
access channel with two 
transmitters and one receiver. (b) 
A binary multiple access channel 
with output equal to the sum of 
two inputs. (c) The achievable 
region. 
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gl) 
Pyle), 2) y 
(B) 
g 
(a) 
y: g) a 
0 1 1/2--- Achievable -- 
xB) 9 es 
(b) rl fee (0) a 
1/2 LR, 


are C) and C), then by devoting a fraction ¢4 of the transmission time 
to channel A and ¢g=1-—¢, to channel B, we can achieve (Ro, Ra, Rg) = 
(0, ¢4C%), dpC™)). 

We can do better than this, however. As an analogy, imagine speaking 
simultaneously to an American and a Belarusian; you are fluent in American 
and in Belarusian, but neither of your two receivers understands the other’s 
language. If each receiver can distinguish whether a word is in their own 
language or not, then an extra binary file can be conveyed to both recipients by 
using its bits to decide whether the next transmitted word should be from the 
American source text or from the Belarusian source text. Each recipient can 
concatenate the words that they understand in order to receive their personal 
message, and can also recover the binary string. 

An example of a broadcast channel consists of two binary symmetric chan- 
nels with a common input. The two halves of the channel have flip prob- 
abilities f4 and fg. We’ll assume that A has the better half-channel, i.e., 
fa < fp < 1⁄2. [A closely related channel is a ‘degraded’ broadcast channel, 
in which the conditional probabilities are such that the random variables have 
the structure of a Markov chain, 


(A) _, (8), 


(15.4) 


zy 
i.e., yP) is a further degraded version of yA) ] In this special case, it turns 
out that whatever information is getting through to receiver B can also be 
recovered by receiver A. So there is no point distinguishing between Rp and 
Rpg: the task is to find the capacity region for the rate pair (Ro, RA), where 
Ro is the rate of information reaching both A and B, and R4 is the rate of 
the extra information reaching A. 
The following exercise is equivalent to this one, and a solution to it is 
illustrated in figure 15.8. 


Exercise 15.20.!7! Variable-rate error-correcting codes for channels with unknown 
noise level. In real life, channels may sometimes not be well characterized 
before the encoder is installed. As a model of this situation, imagine that a 
channel is known to be a binary symmetric channel with noise level either fA 
or fg. Let fg > fa, and let the two capacities be C'4 and Cg. 

Those who like to live dangerously might install a system designed for noise 
level fa with rate RA œ Cy; in the event that the noise level turns out to be 
fg, our experience of Shannon’s theories would lead us to expect that there 


Rp 
Cc) 


CWA) Ra 


Figure 15.5. Rates achievable by 
simple timesharing. 














aS 


Figure 15.6. Rate of reliable 
communication R, as a function of 
noise level f, for Shannonesque 
codes designed to operate at noise 
levels f4 (solid line) and fg 
(dashed line). 
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would be a catastrophic failure to communicate information reliably (solid line 
in figure 15.6). 

A conservative approach would design the encoding system for the worst- 
case scenario, installing a code with rate Rg ~ Cg (dashed line in figure 15.6). 
In the event that the lower noise level, f4, holds true, the managers would 
have a feeling of regret because of the wasted capacity difference C'4 — Rp. 

Is it possible to create a system that not only transmits reliably at some 
rate Ro whatever the noise level, but also communicates some extra, ‘lower- 
priority’ bits if the noise level is low, as shown in figure 15.7? This code 
communicates the high-priority bits reliably at all noise levels between f4 and 
fg, and communicates the low-priority bits also if the noise level is f4 or 
below. 

This problem is mathematically equivalent to the previous problem, the 
degraded broadcast channel. The lower rate of communication was there called 
Ro, and the rate at which the low-priority bits are communicated if the noise 
level is low was called Ry. 

An illustrative answer is shown in figure 15.8, for the case fa = 0.01 and 
fg = 0.1. (This figure also shows the achievable region for a broadcast channel 
whose two half-channels have noise levels f4 = 0.01 and fg = 0.1.) I admit I 
find the gap between the simple time-sharing solution and the cunning solution 
disappointingly small. 

In Chapter 50 we will discuss codes for a special class of broadcast channels, 
namely erasure channels, where every symbol is either received without error 
or erased. These codes have the nice property that they are rateless — the 
number of symbols transmitted is determined on the fly such that reliable 
comunication is achieved, whatever the erasure statistics of the channel. 


Exercise 15.21.19] Multiterminal information networks are both important practi- 
cally and intriguing theoretically. Consider the following example of a two-way 
binary channel (figure 15.9a,b): two people both wish to talk over the channel, 
and they both want to hear what the other person is saying; but you can hear 
the signal transmitted by the other person only if you are transmitting a zero. 
What simultaneous information rates from A to B and from B to A can be 
achieved, and how? Everyday examples of such networks include the VHF 
channels used by ships, and computer ethernet networks (in which all the 
devices are unable to hear anything if two or more devices are broadcasting 
simultaneously). 

Obviously, we can achieve rates of 1/2 in both directions by simple time- 
sharing. But can the two information rates be made larger? Finding the 
capacity of a general two-way channel is still an open problem. However, 
we can obtain interesting results concerning achievable points for the simple 
binary channel discussed above, as indicated in figure 15.9c. There exist codes 
that can achieve rates up to the boundary shown. There may exist better 
codes too. 


Solutions 


Solution to exercise 15.12 (p.235). C(Q) = 5 bits. 
Hint for the last part: a solution exists that involves a simple (8,5) code. 
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Figure 15.7. Rate of reliable 
communication R, as a function of 
noise level f, for a desired 
variable-rate code. 
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Figure 15.8. An achievable region 
for the channel with unknown 
noise level. Assuming the two 
possible noise levels are f4 = 0.01 
and fg = 0.1, the dashed lines 
show the rates R4, Rpg that are 
achievable using a simple 
time-sharing approach, and the 
solid line shows rates achievable 
using a more cunning approach. 
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Figure 15.9. (a) A general 
two-way channel. (b) The rules 
for a binary two-way channel. 
The two tables show the outputs 
y) and y(P) that result for each 
state of the inputs. (c) Achievable 
region for the two-way binary 
channel. Rates below the solid 
line are achievable. The dotted 
line shows the ‘obviously 
achievable’ region which can be 
attained by simple time-sharing. 
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Message Passing 


One of the themes of this book is the idea of doing complicated calculations 
using simple distributed hardware. It turns out that quite a few interesting 
problems can be solved by message-passing algorithms, in which simple mes- 
sages are passed locally among simple processors whose operations lead, after 
some time, to the solution of a global problem. 


> 16.1 Counting 


As an example, consider a line of soldiers walking in the mist. The commander 
wishes to perform the complex calculation of counting the number of soldiers 
in the line. This problem could be solved in two ways. 

First there is a solution that uses expensive hardware: the loud booming 
voices of the commander and his men. The commander could shout ‘all soldiers 
report back to me within one minute!’, then he could listen carefully as the 
men respond ‘Molesworth here sir!’, ‘Fotherington—Thomas here sir!’, and so 
on. This solution relies on several expensive pieces of hardware: there must be 
a reliable communication channel to and from every soldier; the commander 
must be able to listen to all the incoming messages — even when there are 
hundreds of soldiers — and must be able to count; and all the soldiers must be 
well-fed if they are to be able to shout back across the possibly-large distance 
separating them from the commander. 

The second way of finding this global function, the number of soldiers, 
does not require global communication hardware, high IQ, or good food; we 
simply require that each soldier can communicate single integers with the two 
adjacent soldiers in the line, and that the soldiers are capable of adding one 
to a number. Each soldier follows these rules: 





. If you are the front soldier in the line, say the number ‘one’ to the Algorithm 16.1. Message-passing 
soldier behind you. rule-set A. 


. If you are the rearmost soldier in the line, say the number ‘one’ to 
the soldier in front of you. 


. If a soldier ahead of or behind you says a number to you, add one 
to it, and say the new number to the soldier on the other side. 





If the clever commander can not only add one to a number, but also add 
two numbers together, then he can find the global number of soldiers by simply 
adding together: 
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(which equals the total number of 
soldiers in front) 
(which is the number behind) 


the number said to him by the 
soldier in front of him, 

the number said to the com- 
mander by the soldier behind 
him, 
one (to count the commander himself). 





This solution requires only local communication hardware and simple compu- 
tations (storage and addition of integers). 


Ae a My ee 


Commander 


Separation 


This clever trick makes use of a profound property of the total number of 
soldiers: that it can be written as the sum of the number of soldiers in front 
of a point and the number behind that point, two quantities which can be 
computed separately, because the two groups are separated by the commander. 

If the soldiers were not arranged in a line but were travelling in a swarm, 
then it would not be easy to separate them into two groups in this way. The 


s >) 
Y © 
ae ee hd ES / SG ee 
2 A oe TEs eka - 
S AN hae? ) 
A 2 > Ne J S i 5% 
D < SA ommander 
l AN 
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<4 Jim pa 
S 
wey 


guerillas in figure 16.3 could not be counted using the above message-passing 
rule-set A, because, while the guerillas do have neighbours (shown by lines), 
it is not clear who is ‘in front’ and who is ‘behind’; furthermore, since the 
graph of connections between the guerillas contains cycles, it is not possible 
for a guerilla in a cycle (such as ‘Jim’) to separate the group into two groups, 
‘those in front’, and ‘those behind’. 

A swarm of guerillas can be counted by a modified message-passing algo- 
rithm if they are arranged in a graph that contains no cycles. 


Rule-set B is a message-passing algorithm for counting a swarm of guerillas 
whose connections form a cycle-free graph, also known as a tree, as illustrated 
in figure 16.4. Any guerilla can deduce the total in the tree from the messages 
that they receive. 
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Figure 16.2. A line of soldiers 
counting themselves using 
message-passing rule-set A. The 
commander can add ‘3’ from the 
soldier in front, ‘1’ from the 
soldier behind, and ‘1’ for himself, 
and deduce that there are 5 
soldiers in total. 


Figure 16.3. A swarm of guerillas. 
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oe A Figure 16.4. A swarm of guerillas 
Co es Vise whose connections form a tree. 
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. Count your number of neighbours, N. Algorithm 16.5. Message-passing 


rule-set B. 
. Keep count of the number of messages you have received from your 


neighbours, m, and of the values v1, v2, ...,un of each of those 


messages. Let V be the running total of the messages you have 
received. 


. If the number of messages you have received, m, is equal to N — 1, 


then identify the neighbour who has not sent you a message and tell 
them the number V + 1. 


. If the number of messages you have received is equal to N, then: 


(a) the number V + 1 is the required total. 


(b) for each neighbour n { 
say to neighbour n the number V + 1 — vp. 
} 


Figure 16.6. A triangular 41 x 41 
grid. How many paths are there 
from A to B? One path is shown. 
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> 16.2 Path-counting 


A more profound task than counting squaddies is the task of counting the 
number of paths through a grid, and finding how many paths pass through 
any given point in the grid. 

Figure 16.6 shows a rectangular grid, and a path through the grid, con- 
necting points A and B. A valid path is one that starts from A and proceeds 
to B by rightward and downward moves. Our questions are: 


1. How many such paths are there from A to B? 


2. If a random path from A to B is selected, what is the probability that it 
passes through a particular node in the grid? [When we say ‘random’, we 
mean that all paths have exactly the same probability of being selected.] 


3. How can a random path from A to B be selected? 


Counting all the paths from A to B doesn’t seem straightforward. The number 
of paths is expected to be pretty big — even if the permitted grid were a diagonal 
strip only three nodes wide, there would still be about 2/? possible paths. 

The computational breakthrough is to realize that to find the number of 
paths, we do not have to enumerate all the paths explicitly. Pick a point P in 
the grid and consider the number of paths from A to P. Every path from A 
to P must come in to P through one of its upstream neighbours (‘upstream’ 
meaning above or to the left). So the number of paths from A to P can be 
found by adding up the number of paths from A to each of those neighbours. 

This message-passing algorithm is illustrated in figure 16.8 for a simple 
grid with ten vertices connected by twelve directed edges. We start by send- 
ing the ‘1’ message from A. When any node has received messages from all its 
upstream neighbours, it sends the sum of them on to its downstream neigh- 
bours. At B, the number 5 emerges: we have counted the number of paths 
from A to B without enumerating them all. As a sanity-check, figure 16.9 
shows the five distinct paths from A to B. 

Having counted all paths, we can now move on to more challenging prob- 
lems: computing the probability that a random path goes through a given 
vertex, and creating a random path. 


Probability of passing through a node 


By making a backward pass as well as the forward pass, we can deduce how 
many of the paths go through each node; and if we divide that by the total 
number of paths, we obtain the probability that a randomly selected path 
passes through that node. Figure 16.10 shows the backward-passing messages 
in the lower-right corners of the tables, and the original forward-passing mes- 
sages in the upper-left corners. By multiplying these two numbers at a given 
vertex, we find the total number of paths passing through that vertex. For 
example, four paths pass through the central vertex. 

Figure 16.11 shows the result of this computation for the triangular 41 x 
41 grid. The area of each blob is proportional to the probability of passing 
through the corresponding node. 


Random path sampling 


Exercise 16.1.!4 P-247] Tf one creates a ‘random’ path from A to B by flipping 
a fair coin at every junction where there is a choice of two directions, is 
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Figure 16.7. Every path from A to 
P enters P through an upstream 
neighbour of P, either M or N; so 
we can find the number of paths 
from A to P by adding the 
number of paths from A to M to 
the number from A to N. 
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Figure 16.8. Messages sent in the 
forward pass. 
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Figure 16.9. The five paths. 
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Figure 16.10. Messages sent in the 
forward and backward passes. 
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the resulting path a uniform random sample from the set of all paths? 
(Hint: imagine trying it for the grid of figure 16.8.] 


There is a neat insight to be had here, and I’d like you to have the satisfaction 
of figuring it out. 


“Exercise 16.2.1% P-247] Having run the forward and backward algorithms be- 
= tween points A and B on a grid, how can one draw one path from A to 
B uniformly at random? (Figure 16.11.) 





A Figure 16.11. (a) The probability 
of passing through each node, and 
(b) a randomly chosen path. 














(a) (b) 





The message-passing algorithm we used to count the paths to B is an 
example of the sum-product algorithm. The ‘sum’ takes place at each node 
when it adds together the messages coming from its predecessors; the ‘product’ 
was not mentioned, but you can think of the sum as a weighted sum in which 
all the summed terms happened to have weight 1. 


> 16.3 Finding the lowest-cost path 


Imagine you wish to travel as quickly as possible from Ambridge (A) to Bognor 
(B). The various possible routes are shown in figure 16.12, along with the cost 
in hours of traversing each edge in the graph. For example, the route A-I-L- 
N-B has a cost of 8 hours. We would like to find the lowest-cost path without 
explicitly evaluating the cost of all paths. We can do this efficiently by finding 
for each node what the cost of the lowest-cost path to that node from A is. A 
These quantities can be computed by message-passing, starting from node A. 1 
The message-passing algorithm is called the min-sum algorithm or Viterbi 
algorithm. 

For brevity, we’ll call the cost of the lowest-cost path from node A to Figure 16.12. Route diagram from 
node «x ‘the cost of x’. Each node can broadcast its cost to its descendants Ambridge to Bognor, showing the 
once it knows the costs of all its possible predecessors. Let’s step through the costs associated with the edges. 
algorithm by hand. The cost of A is zero. We pass this news on to H and I. 

As the message passes along each edge in the graph, the cost of that edge is 
added. We find the costs of H and I are 4 and 1 respectively (figure 16.13a). 
Similarly then, the costs of J and L are found to be 6 and 2 respectively, but 
what about K? Out of the edge H-K comes the message that a path of cost 5 
exists from A to K via H; and from edge I-K we learn of an alternative path 
of cost 3 (figure 16.13b). The min-sum algorithm sets the cost of K equal 
to the minimum of these (the ‘min’), and records which was the smallest-cost 
route into K by retaining only the edge I-K and pruning away the other edges 
leading to K (figure 16.13c). Figures 16.13d and e show the remaining two 
iterations of the algorithm which reveal that there is a path from A to B with 
cost 6. [If the min-sum algorithm encounters a tie, where the minimum-cost 
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path to a node is achieved by more than one route to it, then the algorithm 
can pick any of those routes at random.] 

We can recover this lowest-cost path by backtracking from B, following 
the trail of surviving edges back to A. We deduce that the lowest-cost path is 


A-I-K-M-B. 
(a) J~2 
Other applications of the min-sum algorithm Ka mi i 
4-H 2 aay 
Imagine that you manage the production of a product from raw materials A 2 
via a large set of operations. You wish to identify the critical path in your 1 : 1 wee 
process, that is, the subset of operations that are holding up production. If 1a. Pea 
any operations on the critical path were carried out a little faster then the 
time to get from raw materials to product would be reduced. (b) 2 6 2 
The critical path of a set of operations can be found using the min-sum 4 Z 2 MN 
algorithm. 0 A ; 5: A B 
In Chapter 25 the min-sum algorithm will be used in the decoding of A 1^1 Z? DAN a 
error-correcting codes. $ Da K 
L 
> 16.4 Summary and related ideas (c) jo S 
4 
1 M4 
Some global functions have a separability property. For example, the number ü < H r a Se 
of paths from A to P separates into the sum of the number of paths from A to M A 2y K gt 
ta 1“y 3 
(the point to P’s left) and the number of paths from A to N (the point above I 2” 
P). Such functions can be computed efficiently by message-passing. Other L3 
functions do not have such separability properties, for example (d) 6 5 
225 J 
1. the number of pairs of soldiers in a troop who share the same birthday; 1 aq” é AN 
2. the size of the largest group of soldiers who share a common height 7 2 A B 
(rounded to the nearest centimetre); 1 1 1 4-5 
1752 
3. the length of the shortest tour that a travelling salesman could take that E3 
visits every soldier in a troop. (e) r 6 2 
One of the challenges of machine learning is to find low-cost solutions to prob- 4 aa 2 5 1 
lems like these. The problem of finding a large subset of variables that are n 9 A 6 
approximately equal can be solved with a neural network approach (Hopfield sla F Se 154 3 
and Brody, 2000; Hopfield and Brody, 2001). A neural approach to the trav- : [ars X 


elling salesman problem will be discussed in section 42.9. 
Figure 16.13. Min-sum 
> 16.5 Further exercises message-passing algorithm to find 
the cost of getting to each node, 
and thence the lowest cost route 
from A to B. 


> Exercise 16.3.[21 


picted in figure 16.11la, for a grid in a triangle of width and height N. 
[2] 


Describe the asymptotic properties of the probabilities de- 


> Exercise 16.4.'~! In image processing, the integral image I(x, y) obtained from 
an image f(x,y) (where x and y are pixel coordinates) is defined by 


z y 

I(x,y) =X X f(u,v). (16.1) 

u=0 v=0 

Show that the integral image I(x, y) can be efficiently computed by mes- y2 
sage passing. 
Show that, from the integral image, some simple functions of the image 
can be obtained. For example, give an expression for the sum of the (0,0) 
image intensities f(x,y) for all (x,y) in a rectangular region extending 
from (x1, yi) to (£2, y2). 
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> 16.6 Solutions 


Solution to exercise 16.1 (p.244). Since there are five paths through the grid 
of figure 16.8, they must all have probability 1/5. But a strategy based on fair 
coin-flips will produce paths whose probabilities are powers of 1/2. 


Solution to exercise 16.2 (p.245). To make a uniform random walk, each for- 
ward step of the walk should be chosen using a different biased coin at each 
junction, with the biases chosen in proportion to the backward messages ema- 
nating from the two options. For example, at the first choice after leaving A, 
there is a ‘3’ message coming from the East, and a ‘2’ coming from South, so 
one should go East with probability 3/5 and South with probability 2/5. This 
is how the path in figure 16.11b was generated. 
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Communication over Constrained 
Noiseless Channels 


In this chapter we study the task of communicating efficiently over a con- 
strained noiseless channel — a constrained channel over which not all strings 
from the input alphabet may be transmitted. 

We make use of the idea introduced in Chapter 16, that global properties 
of graphs can be computed by a local message-passing algorithm. 


> 17.1 Three examples of constrained binary channels 


A constrained channel can be defined by rules that define which strings are 
permitted. 


Example 17.1. In Channel A every 1 must be followed by at least one 0. 


A valid string for this channel is 


00100101001010100010. (17.1) 


As a motivation for this model, consider a channel in which 1s are repre- 
sented by pulses of electromagnetic energy, and the device that produces 
those pulses requires a recovery time of one clock cycle after generating 
a pulse before it can generate another. 


Example 17.2. Channel B has the rule that all 1s must come in groups of two 
or more, and all Os must come in groups of two or more. 


A valid string for this channel is 


00111001110011000011. (17.2) 


As a motivation for this model, consider a disk drive in which succes- 
sive bits are written onto neighbouring points in a track along the disk 
surface; the values 0 and 1 are represented by two opposite magnetic 
orientations. The strings 101 and 010 are forbidden because a single 
isolated magnetic domain surrounded by domains having the opposite 
orientation is unstable, so that 101 might turn into 111, for example. 


Example 17.3. Channel C has the rule that the largest permitted runlength is 
two, that is, each symbol can be repeated at most once. 


A valid string for this channel is 


10010011011001101001. (17.3) 
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Channel A: 
the substring 11 is forbidden. 


Channel B: 
101 and 010 are forbidden. 


Channel C: 
111 and 000 are forbidden. 
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A physical motivation for this model is a disk drive in which the rate of 
rotation of the disk is not known accurately, so it is difficult to distinguish 
between a string of two 1s and a string of three 1s, which are represented 
by oriented magnetizations of duration 27 and 37 respectively, where 
7 is the (poorly known) time taken for one bit to pass by; to avoid 
the possibility of confusion, and the resulting loss of synchronization of 
sender and receiver, we forbid the string of three 1s and the string of 
three Os. 


All three of these channels are examples of runlength-limited channels. 
The rules constrain the minimum and maximum numbers of successive 1s and 


Os. 
Channel Runlength of 1s Runlength of Os 
minimum maximum minimum maximum 
unconstrained 1 oO 1 (ere) 
A 1 1 1 ee) 
B 2 oo 2 ee) 
C 1 2 1 2 


In channel A, runs of Os may be of any length but runs of 1s are restricted to 
length one. In channel B all runs must be of length two or more. In channel 
C, all runs must be of length one or two. 

The capacity of the unconstrained binary channel is one bit per channel 
use. What are the capacities of the three constrained channels? [To be fair, 
we haven’t defined the ‘capacity’ of such channels yet; please understand ‘ca- 
pacity’ as meaning how many bits can be conveyed reliably per channel-use.] 


Some codes for a constrained channel 


Let us concentrate for a moment on channel A, in which runs of Os may be 
of any length but runs of 1s are restricted to length one. We would like to 
communicate a random binary file over this channel as efficiently as possible. 


A simple starting point is a (2,1) code that maps each source bit into two Gode Ci 
transmitted bits, C1. This is a rate-1/2 code, and it respects the constraints of g = 
channel A, so the capacity of channel A is at least 0.5. Can we do better? 

Ci is redundant because if the first of two received bits is a zero, we know : = 


that the second bit will also be a zero. We can achieve a smaller average 
transmitted length using a code that omits the redundant zeroes in C4. 

Cə is such a variable-length code. If the source symbols are used with Code C2 
equal frequency then the average transmitted length per source bit is a 


s t 
1 1 
L= -1 + -2 = a (17.4) Oz 0 
2 2 2 1 10 
so the average communication rate is 
R=2,, (17.5) 


and the capacity of channel A must be at least 2/3. 

Can we do better than C2? There are two ways to argue that the infor- 
mation rate could be increased above R = 2/3. 

The first argument assumes we are comfortable with the entropy as a 
measure of information content. The idea is that, starting from code C2, we 
can reduce the average message length, without greatly reducing the entropy 
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of the message we send, by decreasing the fraction of 1s that we transmit. 
Imagine feeding into C2 a stream of bits in which the frequency of 1s is f. [Such 
a stream could be obtained from an arbitrary binary file by passing the source oe pees 
file into the decoder of an arithmetic code that is optimal for compressing fee H_2(f) — 
binary strings of density f.] The information rate R achieved is the entropy ee se d 
of the source, Hə( f), divided by the mean transmitted length, 

















L(f)=(0-f)+2f=1+f. (17.6) 
Thus P z : 0 0.25 05 075 1 
Rf) = BD = (17.7) 


The original code C2, without preprocessor, corresponds to f = 1/2. What 
happens if we perturb f a little towards smaller f, setting 








f= 5 +6, (17.8) 





0 L fi fi 





for small negative ô? In the vicinity of f = 1/2, the denominator L(f) varies 0 0.25 0.5 0.75 1 
linearly with ô. In contrast, the numerator Hə(f) only has a second-order 


Figure 17.1. Top: The information 
dependence on ô. 


content per source symbol and 


E ‘se 17.4 [1] Find der 2. the Tayl . EH f : mean transmitted length per 
> Exercise 17.4. ind, to order 6°, the Taylor expansion of H2(f) as a function source symbol as a function of the 


of 6. source density. Bottom: The 
information content per 
To first order, R(f) increases linearly with decreasing ô. It must be possible transmitted symbol, in bits, as a 


to increase R by decreasing f. Figure 17.1 shows these functions; R(f) does function of f. 
indeed increase as f decreases and has a maximum of about 0.69 bits per 
channel use at f ~ 0.38. 
By this argument we have shown that the capacity of channel A is at least 
max, R(f) = 0.69. 


> Exercise 17.5.1% P257] Tf a file containing a fraction f = 0.5 1s is transmitted 
by C2, what fraction of the transmitted stream is 1s? 


What fraction of the transmitted bits is 1s if we drive code Cə with a 
sparse source of density f = 0.38? 


A second, more fundamental approach counts how many valid sequences 
of length N there are, Sy. We can communicate log Sy bits in N channel 
cycles by giving one name to each of these valid sequences. 


> 17.2 The capacity of a constrained noiseless channel 


We defined the capacity of a noisy channel in terms of the mutual information 
between its input and its output, then we proved that this number, the capac- 
ity, was related to the number of distinguishable messages S(N) that could be 
reliably conveyed over the channel in N uses of the channel by 


. 1 
C= Jim, 7 log S(N). (17.9) 


In the case of the constrained noiseless channel, we can adopt this identity as 
our definition of the channel’s capacity. However, the name s, which, when 
we were making codes for noisy channels (section 9.6), ran over messages 
s = 1,..., S, is about to take on a new role: labelling the states of our channel; 
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so in this chapter we will denote the number of distinguishable messages of 
length N by My, and define the capacity to be: 


' 1 
C= Jim, y los Mn. (17.10) 


Once we have figured out the capacity of a channel we will return to the 
task of making a practical code for that channel. 


> 17.3 Counting the number of possible messages 


First let us introduce some representations of constrained channels. In a state 
diagram, states of the transmitter are represented by circles labelled with the 
name of the state. Directed edges from one state to another indicate that 
the transmitter is permitted to move from the first state to the second, and a 
label on that edge indicates the symbol emitted when that transition is made. 
Figure 17.2a shows the state diagram for channel A. It has two states, 0 and 
1. When transitions to state 0 are made, a 0 is transmitted; when transitions 
to state 1 are made, a 1 is transmitted; transitions from state 1 to state 1 are 
not possible. 

We can also represent the state diagram by a trellis section, which shows 
two successive states in time at two successive horizontal locations (fig- 
ure 17.2b). The state of the transmitter at time n is called sp. The set of 
possible state sequences can be represented by a trellis as shown in figure 17.2c. 
A valid sequence corresponds to a path through the trellis, and the number of 
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Figure 17.4. Counting the number 
of paths in the trellis of channel 
A. The counts next to the nodes 
are accumulated by passing from 


1 1 1 1 1 2 
© @ © @ © © 

ge va pa we p< a left to right across the trellises. 
©O—-Q = O—-O— OQ OOOO 
1 1 2 1 2 3 


M,=2 Mı=2 M2=3 My =2 M2=3 M3=5 











M,=2 Mg=3 M3=5 Myg=8 Ms=13 Mg=21 M7=34 Mg=55 
1 1 2 3 5 8 13 21 
ZX XX XXX 
oS 0 oO 
1 2 3 5 8 13 21 34 


M,=2 M2=3 M3=5 M4=8 M5=13 Me=21 M7=34 Mg=55 





(a) Channel A 








(b) Channel B 3 4 6 17 
CE 6-6 36 5 5 6 

1 2 4 11 

© © © © 

4 10 

© © © 


@——. a a ah T @ 
17 














Mı=1 M2=2 M3=3 Ma=5 Ms=8 Me=138 M7=21 Mg=34 


(c) Channel C 1 1 2 2 4 7 
@ © @ @ D @ @ @ 
1 1 2 2 4 7 10 

© © © © O © O © 

1 1 1 3 4 6 11 

© © © © © © © © 

1 1 1 3 4 6 


Figure 17.5. Counting the number of paths in the trellises of channels A, B, and C. We assume that at 
the start the first bit is preceded by 00, so that for channels A and B, any initial character 
is permitted, but for channel C, the first character must be a 1. 
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17.3: Counting the number of possible messages 


n Mn Mn/Mn-ı1 logg Mn + logs Mn 
1 2 1.0 1.00 
2 3 1.500 1.6 0.79 
3 5 1.667 2.3 0.77 
4 8 1.600 3.0 0.75 
5 13 1.625 5.7 0.74 
6 21 1.615 4.4 0.73 
7 34 1.619 5.1 0.73 
8 55 1.618 5.8 0.72 
9 89 1.618 6.5 0.72 
10 144 1.618 7.2 0.72 
11 233 1.618 7.9 0.71 
12 377 1.618 8.6 0.71 
100 9x10% 1.618 69.7 0.70 
200 7x10% 1.618 139.1 0.70 
300 6x 10% 1.618 208.5 0.70 
400 5x108 1.618 277.9 0.69 


valid sequences is the number of paths. For the purpose of counting how many 
paths there are through the trellis, we can ignore the labels on the edges and 
summarize the trellis section by the connection matrix A, in which Ass = 1 
if there is an edge from state s to s’, and Ass = 0 otherwise (figure 17.2d). 
Figure 17.3 shows the state diagrams, trellis sections and connection matrices 
for channels B and C. 

Let’s count the number of paths for channel A by message-passing in its 
trellis. Figure 17.4 shows the first few steps of this counting process, and 
figure 17.5a shows the number of paths ending in each state after n steps for 
n=1,...,8. The total number of paths of length n, Mn, is shown along the 
top. We recognize Mn as the Fibonacci series. 


Exercise 17.6. [1] Show that the ratio of successive terms in the Fibonacci series 
tends to the golden ratio, 


1+v5 
2 





y= = 1.618. (17.11) 


Thus, to within a constant factor, My scales as My ~ y as N > œ, so the 
capacity of channel A is 


1 2 
C = lim N logs [constant - y^] = logy y = logy 1.618 = 0.694. (17.12) 


How can we describe what we just did? The count of the number of paths 





is a vector c™); we can obtain c+) from ¢™ using: 

cD) = Ac, (17.13) 
So 

cM) = AN0, (17.14) 


where c() is the state count before any symbols are transmitted. In figure 17.5 
we assumed c) = (0, 1]', i.e., that either of the two symbols is permitted at 
the outset. The total number of paths is Mn = -, os”) = c") .n. In the limit, 
c\%) becomes dominated by the principal right-eigenvector of A. 


(N) 


cY) — constant - Ne, (17.15) 


253 


Figure 17.6. Counting the number 
of paths in the trellis of channel A. 
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Here, A, is the principal eigenvalue of A. 


So to find the capacity of any constrained channel, all we need to do is find t 
the principal eigenvalue, 1, of its connection matrix. Then 
21 (D] 20 
C = logy v1. (17.16) 4 





> 17.4 Back to our model channels 


Comparing figure 17.5a and figures 17.5b and c it looks as if channels B and 21 (D] 70+——t 
C have the same capacity as channel A. The principal eigenvalues of the three | 
trellises are the same (the eigenvectors for channels A and B are given at the o— $ 


bottom of table C.4, p.608). And indeed the channels are intimately related. 
Figure 17.7. An accumulator and 
a differentiator. 

Equivalence of channels A and B 


If we take any valid string s for channel A and pass it through an accumulator, 
obtaining t defined by: 


t = sS 


tn tn-1 + 5nmod2 for n> 2, (17.17) 


then the resulting string is a valid string for channel B, because there are no 
11s in s, so there are no isolated digits in t. The accumulator is an invertible 
operator, so, similarly, any valid string t for channel B can be mapped onto a 
valid string s for channel A through the binary differentiator, 


Ss = ty 


tn —tn-1mod2 for n> 2. (17.18) 


Sn 


Because + and — are equivalent in modulo 2 arithmetic, the differentiator is 
also a blurrer, convolving the source stream with the filter (1, 1). 
Channel C is also intimately related to channels A and B. 


ce(s) 


> Exercise 17.7.4 P-257] What is the relationship of channel C to channels A j 
and B? 1 00000 
2 10000 
; Wes i 3 01000 
> 17.5 Practical communication over constrained channels 4 00100 
_ 5 00010 
OK, how to do it in practice? Since all three channels are equivalent, we can 6 10100 
concentrate on channel A. 7 01010 
8 10010 


Fixed-length solutions 
Table 17.8. A runlength-limited 
We start with explicitly-enumerated codes. The code in the table 17.8 achieves code for channel A. 


a rate of 3/5 = 0.6. 


> Exercise 17.8.4 P257] Similarly, enumerate all strings of length 8 that end in 
the zero state. (There are 34 of them.) Hence show that we can map 5 
bits (32 source strings) to 8 transmitted bits and achieve rate 5/8 = 0.625. 


What rate can be achieved by mapping an integer number of source bits 
to N = 16 transmitted bits? 
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17.5: Practical communication over constrained channels 255 


Optimal variable-length solution 


The optimal way to convey information over the constrained channel is to find 
the optimal transition probabilities for all points in the trellis, Q ys, and make 
transitions with these probabilities. 

When discussing channel A, we showed that a sparse source with density 
f = 0.38, driving code C2, would achieve capacity. And we know how to 
make sparsifiers (Chapter 6): we design an arithmetic code that is optimal 
for compressing a sparse source; then its associated decoder gives an optimal 
mapping from dense (i.e., random binary) strings to sparse strings. 

The task of finding the optimal probabilities is given as an exercise. 


Exercise 17.9.19] Show that the optimal transition probabilities Q can be found 
as follows. 


Find the principal right- and left-eigenvectors of A, that is the solutions 
of Ae(®) = del®) and eA = \e“)" with largest eigenvalue A. Then 


construct a matrix Q whose invariant distribution is proportional to 


e™ el), namely 
et) Ags 


el) 


S 


Qs!\s = (17.19) 


(Hint: exercise 16.2 (p.245) might give helpful cross-fertilization here.| 


> Exercise 17.10.% P-258] Show that when sequences are generated using the op- 
timal transition probability matrix (17.19), the entropy of the resulting 
sequence is asymptotically log, A per symbol. [Hint: consider the condi- 
tional entropy of just one symbol given the previous one, assuming the 
previous one’s distribution is the invariant distribution.] 


In practice, we would probably use finite-precision approximations to the 
optimal variable-length solution. One might dislike variable-length solutions 
because of the resulting unpredictability of the actual encoded length in any 
particular case. Perhaps in some applications we would like a guarantee that 
the encoded length of a source file of size N bits will be less than a given 
length such as N/(C + €). For example, a disk drive is easier to control if 
all blocks of 512 bytes are known to take exactly the same amount of disk 
real-estate. For some constrained channels we can make a simple modification 
to our variable-length encoding and offer such a guarantee, as follows. We 
find two codes, two mappings of binary strings to variable-length encodings, 


having the property that for any source string x, if the encoding of x under 010 

the first code is shorter than average, then the encoding of x under the second 0 0 1 

code is longer than average, and vice versa. Then to transmit a string x we 1 1 1 

encode the whole string with both codes and send whichever encoding has the 

shortest length, prepended by a suitably encoded single bit to convey which 0100 

of the two codes is being used. 00 10 
0 0 0 1 
111i 





> Exercise 17.11.120 P-258] How many valid sequences of length 8 starting with 
a 0 are there for the run-length-limited channels shown in figure 17.9? 


yY 


Figure 17.9. State diagrams and 
connection matrices for channels 
with maximum runlengths for 1s 
equal to 2 and 3. 


What are the capacities of these channels? 


Using a computer, find the matrices Q for generating a random path 
through the trellises of the channel A, and the two run-length-limited 
channels shown in figure 17.9. 
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> Exercise 17.12.19 P-258] Consider the run-length-limited channel in which any 
length of run of Os is permitted, and the maximum run length of 1s is a 
large number L such as nine or ninety. 


Estimate the capacity of this channel. (Give the first two terms in a 
series expansion involving L.) 


What, roughly, is the form of the optimal matrix Q for generating a 
random path through the trellis of this channel? Focus on the values of 
the elements Q1)\9, the probability of generating a 1 given a preceding 0, 
and Q7),-1, the probability of generating a 1 given a preceding run of 
L—1 1s. Check your answer by explicit computation for the channel in 
which the maximum runlength of 1s is nine. 


> 17.6 Variable symbol durations 


We can add a further frill to the task of communicating over constrained 
channels by assuming that the symbols we send have different durations, and 
that our aim is to communicate at the maximum possible rate per unit time. 
Such channels can come in two flavours: unconstrained, and constrained. 


Unconstrained channels with variable symbol durations 


We encountered an unconstrained noiseless channel with variable symbol du- 
rations in exercise 6.18 (p.125). Solve that problem, and you’ve done this 
topic. The task is to determine the optimal frequencies with which the sym- 
bols should be used, given their durations. 

There is a nice analogy between this task and the task of designing an 
optimal symbol code (Chapter 4). When we make an binary symbol code 
for a source with unequal probabilities p;, the optimal message lengths are 
l* = log 1/p;, so 

pao. (17.20) 


Similarly, when we have a channel whose symbols have durations l; (in some 
units of time), the optimal probability with which those symbols should be 
used is 

pi = 2 Fk, (17.21) 


where (3 is the capacity of the channel in bits per unit time. 


Constrained channels with variable symbol durations 


Once you have grasped the preceding topics in this chapter, you should be 
able to figure out how to define and find the capacity of these, the trickiest 
constrained channels. 


Exercise 17.13.19] A classic example of a constrained channel with variable 
symbol durations is the ‘Morse’ channel, whose symbols are 


the dot d, 
the dash D, 
the short space (used between letters in morse code) s, and 
the long space (used between words) S; 


the constraints are that spaces may only be followed by dots and dashes. 


Find the capacity of this channel in bits per unit time assuming (a) that 
all four symbols have equal durations; or (b) that the symbol durations 
are 2, 4, 3 and 6 time units respectively. 
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Exercise 17.14.14] How well-designed is Morse code for English (with, say, the 
probability distribution of figure 2.1)? 


Exercise 17.15.190] How difficult is it to get DNA into a narrow tube? 


To an information theorist, the entropy associated with a constrained 
channel reveals how much information can be conveyed over it. In sta- 
tistical physics, the same calculations are done for a different reason: to 
predict the thermodynamics of polymers, for example. 


As a toy example, consider a polymer of length N that can either sit 4 
in a constraining tube, of width L, or in the open where there are no í 
constraints. In the open, the polymer adopts a state drawn at random 
from the set of one dimensional random walks, with, say, 3 possible 
directions per step. The entropy of this walk is log 3 per step, i.e., a 
total of Nlog3. [The free energy of the polymer is defined to be —kT 
times this, where T is the temperature.] In the tube, the polymer’s one- 
dimensional walk can go in 3 directions unless the wall is in the way, so 
the connection matrix is, for example (if L = 10), 








1100000000 
1110000000 
0111000000 
0011100000 
0001110000 
0000000111 : 


000000001 1 


Now, what is the entropy of the polymer? What is the change in entropy 
associated with the polymer entering the tube? If possible, obtain an 
expression as a function of L. Use a computer to find the entropy of the 
walk for a particular value of L, e.g. 20, and plot the probability density 
of the polymer’s transverse location in the tube. 


Figure 17.10. Model of DNA 
squashed in a narrow tube. The 
DNA will have a tendency to pop 
Notice the difference in capacity between two channels, one constrained out of the tube, because, outside 
and one unconstrained, is directly proportional to the force required to the tube, its random walk has 
pull the DNA into the tube. greater entropy: 


> 17.7 Solutions 


Solution to exercise 17.5 (p.250). A file transmitted by C2 contains, on aver- 
age, one-third 1s and two-thirds Os. 
If f = 0.38, the fraction of 1s is f/(1+ f) = (y — 1.0)/(2y — 1.0) = 0.2764. 


Solution to exercise 17.7 (p.254). A valid string for channel C can be obtained 
from a valid string for channel A by first inverting it [1 — 0; 0 — 1], then 
passing it through an accumulator. These operations are invertible, so any 
valid string for C can also be mapped onto a valid string for A. The only 
proviso here comes from the edge effects. If we assume that the first character 
transmitted over channel C is preceded by a string of zeroes, so that the first 
character is forced to be a 1 (figure 17.5c) then the two channels are exactly 
equivalent only if we assume that channel A’s first character must be a zero. 


Solution to exercise 17.8 (p.254). With N = 16 transmitted bits, the largest 
integer number of source bits that can be encoded is 10, so the maximum rate 
of a fixed length code with N = 16 is 0.625. 
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Solution to exercise 17.10 (p.255). Let the invariant distribution be 
P(s) = ae eP, (17.22) 
where a is a normalization constant. The entropy of S; given S;-;, assuming Here, as in Chapter 4, S; denotes 


S;_1 comes from the invariant distribution, is the ensemble whose random 
variable is the state s+. 


HA(S;|S:1) = eee (s'|s) log P(s'|s) (17.23) 
2 7) oe Arts pg EN Arts (17.24) 
= — aell og ——_ : 
re Je 
D Ay 
=— N aep a s s! ` = flog el ) 4 log Ass — log À — log ef) 5 (17.25) 


Now, Axs is either 0 or 1, so the contributions from the terms proportional to 
Ag, log Ag’, are all zero. So 


H(S:|S:-1) = Jed +20 (Tae a loge + 
$e (Sahay s) el og (17.26) 


= logrA- 2 ey P loge” Nee joie (P) log e) (17.27) 


log À. (17.28) 


Solution to exercise 17.11 (p.255). The principal eigenvalues of the connection 
matrices of the two channels are 1.839 and 1.928. The capacities (log À) are 
0.879 and 0.947 bits. 


Solution to exercise 17.12 (p.256). The channel is similar to the unconstrained 
binary channel; runs of length greater than L are rare if L is large, so we only 
expect weak differences from this channel; these differences will show up in 
contexts where the run length is close to L. The capacity of the channel is 
very close to one bit. 

A lower bound on the capacity is obtained by considering the simple 
variable-length code for this channel which replaces occurrences of the maxi- 
mum runlength string 111...1 by 111...10, and otherwise leaves the source file 
unchanged. The average rate of this code is 1/(1+2~”) because the invariant 
distribution will hit the ‘add an extra zero’ state a fraction 27% of the time. 

We can reuse the solution for the variable-length channel in exercise 6.18 
(p.125). The capacity is the value of 3 such that the equation 


L+1 


Se st (17.29) 
l 


is satisfied. The L+1 terms in the sum correspond to the L+1 possible strings 
that can be emitted, 0, 10, 110, ... , 11...10. The sum is exactly given by: 


—8\L+1 
gap? 2) = 


Z(8) = aT (17.30) 
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17.7: Solutions 


N 


N+1 _ 1 
Here we used 5 ar” = eee) 


r—1 
n=0 
We anticipate that @ should be a little less than 1 in order for Z(G) to 


equal 1. Rearranging and solving approximately for 8, using ln(1 + x) ~ z, L B True capacity 
2 0.910 0.879 
Z(B) = 1 (17.31) 3 0.955 0.947 
> 6 ~ 1- po) in, (17.32) 4 0.977 0.975 
5 0.9887 0.9881 
Kos . : 3 6 0.9944 0.9942 
We evaluated the true capacities for L = 2 and L = 3 in an earlier exercise. 9 0.9993 0.9993 


The table compares the approximate capacity 3 with the true capacity for a 
selection of values of L. 

The element @1)9 will be close to 1 /2 (just a tiny bit larger), since in the 
unconstrained binary channel Qy)9 = 1 /2. When a run of length L — 1 has 
occurred, we effectively have a choice of printing 10 or 0. Let the probability of 
selecting 10 be f. Let us estimate the entropy of the remaining N characters 
in the stream as a function of f, assuming the rest of the matrix Q to have 
been set to its optimal value. The entropy of the next N characters in the 
stream is the entropy of the first bit, H2(f), plus the entropy of the remaining 
characters, which is roughly (N —1) bits if we select O as the first bit and 
(N—2) bits if 1 is selected. More precisely, if C is the capacity of the channel 
(which is roughly 1), 





H(the next N chars) ~ Ao(f)+[((N-DO—-f)+(N-2)f|C 
Ho(f) +NC— fC ~ Ho(f) +N —f. (17.33) 





Differentiating and setting to zero to find the optimal f, we obtain: 


log, = = 1 s1 fx> => f ~ 1/3. (17.34) 


The probability of emitting a 1 thus decreases from about 0.5 to about 1/3 as 
the number of emitted 1s increases. 
Here is the optimal matrix: 


0 3334 0 0 0 0 0 0 0 0 

0 0 4287 0 0 0 0 0 0 0 

0 0 0 .4669 0 0 0 0 0 0 

0 0 0 0 4841 0 0 0 0 0 

0 0 0 0 0 .4923 0 0 0 0 

0 0 0 0 0 0 .4963 0 0 0 a 
0 0 0 0 0 0 0 .4983 0 0 

0 0 0 0 0 0 0 0 .4993 0 

0 0 0 0 0 0 0 0 0 .4998 

1 


.6666 .5713 .5331 .5159 .5077 .5037 .5017 .5007 .5002 


Our rough theory works. 
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18 


Crosswords and Codebreaking 


In this chapter we make a random walk through a few topics related to lan- 
guage modelling. 


> 18.1 Crosswords 


The rules of crossword-making may be thought of as defining a constrained 
channel. The fact that many valid crosswords can be made demonstrates that 
this constrained channel has a capacity greater than zero. 

There are two archetypal crossword formats. In a ‘type A’ (or American) 


































































































































































crossword, every row and column consists of a succession of words of length 2 2 : z â - â . z 
or more separated by one or more spaces. In a ‘type B’ (or British) crossword, E A L H 3 c R z 
each row and column consists of a mixture of words and single characters, M EIAISIE 
separated by one or more spaces, and every character lies in at least one word 2 + 2 2 n a 
(horizontal or vertical). Whereas in a type A crossword every letter lies in a ult a His EIRIAIS 
horizontal word and a vertical word, in a typical type B crossword only about tlm 2 z ~ 2 T E 
half of the letters do so; the other half lie in one word only. GJA|R|GIL|E M[I foe 
Type A crosswords are harder to create than type B because of the con- L > t 2 an £ A z P £ Ł A 
straint that no single characters are permitted. Type B crosswords are gener- Dlelels i alolR| TA AERO 
ally harder to solve because there are fewer constraints per character. S|U|R|E MMS |TIELE|P H E| LoM 
BJA[N|G[E[R MBA] K/E[R| 1[E[S 
Feat Sali 
Why are crosswords possible? E ac E-EN i ADHA 
If a language has no redundancy, then any letters written on a grid form a u 1 6 x aE EEA 
valid crossword. In a language with high redundancy, on the other hand, it f: A|NjojeE I R]HjalPisjojojy 
is hard to make crosswords (except perhaps a small number of trivial ones). S -W E EIR Jugag 
The possibility of making crosswords in a language thus demonstrates a bound mE E Ae T x P 
on the redundancy of that language. Crosswords are not normally written in B-E- W .o oo a 
genuine English. They are written in ‘word-English’, the language consisting lor wae AE Ile T i Gi Ny 
of strings of words from a dictionary, separated by spaces. BIR IS TLIE sH UISITIEIN 











> Exercise 18.1.!?] Estimate the capacity of word-English, in bits per character. Figure 18.1. Crosswords of types 
(Hint: think of word-English as defining a constrained channel (Chapter A (American) and B (British). 
17) and see exercise 6.18 (p.125).] 


The fact that many crosswords can be made leads to a lower bound on the 
entropy of word-English. 

For simplicity, we now model word-English by Wenglish, the language in- 
troduced in section 4.1 which consists of W words all of length L. The entropy 
of such a language, per character, including inter-word spaces, is: 

_ loga W 


Hy = 18.1 
W= TTi ea) 





260 


18.1: Crosswords 


We'll find that the conclusions we come to depend on the value of Hw and 
are not terribly sensitive to the value of L. Consider a large crossword of size 
S squares in area. Let the number of words be fwS and let the number of 
letter-occupied squares be f,S. For typical crosswords of types A and B made 
of words of length L, the two fractions fw and fı have roughly the values in 
table 18.2. 

We now estimate how many crosswords there are of size S using our simple 
model of Wenglish. We assume that Wenglish is created at random by gener- 
ating W strings from a monogram (i.e., memoryless) source with entropy Ho. 
If, for example, the source used all A = 26 characters with equal probability 
then Ho = log, A = 4.7 bits. If instead we use Chapter 2’s distribution then 
the entropy is 4.2. The redundancy of Wenglish stems from two sources: it 
tends to use some letters more than others; and there are only W words in 
the dictionary. 

Let’s now count how many crosswords there are by imagining filling in 
the squares of a crossword at random using the same distribution that pro- 
duced the Wenglish dictionary and evaluating the probability that this random 
scribbling produces valid words in all rows and columns. The total number of 
typical fillings-in of the f;S squares in the crossword that can be made is 


IT| = 2fSFo, (18.2) 
The probability that one word of length L is validly filled-in is 
W 
B= SIT’ (18.3) 


and the probability that the whole crossword, made of fwS words, is validly 
filled-in by a single typical in-filling is approximately 
pews, (18.4) 


So the log of the number of valid crosswords of size S' is estimated to be 





log Bf"5|T| = S[(fi — fwL)Ho + fu log W] (18.5) 
= S|(fı— fwL)Ao + fu(L + 1)Hw], (18.6) 

which is an increasing function of S only if 
(fi — fwL)Ho + full + 1)Hw > 0. (18.7) 


So arbitrarily many crosswords can be made only if there’s enough words in 
the Wenglish dictionary that 


(ful 5 fi) 


se eee ATT) 


Ho. (18.8) 


Plugging in the values of fı and fw from table 18.2, we find the following. 


Crossword type A B 


“ys 1_L I-L 
Condition for crosswords Hw > zzo Hw > zr Ho 


If we set Hp = 4.2 bits and assume there are W = 4000 words in a normal 
English-speaker’s dictionary, all with length L = 5, then we find that the 
condition for crosswords of type B is satisfied, but the condition for crosswords 
of type A is only just satisfied. This fits with my experience that crosswords 
of type A usually contain more obscure words. 
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A B 
2 1 
fe 7 Fo 
Lii Taal 
f L 5 L 
ee Shee Wie 


Table 18.2. Factors fw and fı by 
which the number of words and 
number of letter-squares 
respectively are smaller than the 
total number of squares. 


This calculation underestimates 
the number of valid Wenglish 
crosswords by counting only 
crosswords filled with ‘typical’ 
strings. If the monogram 
distribution is non-uniform then 
the true count is dominated by 
‘atypical’ fillings-in, in which 
crossword-friendly words appear 
more often. 
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Further reading 


These observations about crosswords were first made by Shannon (1948); I 
learned about them from Wolf and Siegel (1998). The topic is closely related 
to the capacity of two-dimensional constrained channels. An example of a 
two-dimensional constrained channel is a two-dimensional bar-code, as seen 
on parcels. 





Exercise 18.2.19] A two-dimensional channel is defined by the constraint that, 
of the eight neighbours of every interior pixel in an N x N rectangular 
grid, four must be black and four white. (The counts of black and white 
pixels around boundary pixels are not constrained.) A binary pattern Figure 18.3. A binary pattern in 
satisfying this constraint is shown in figure 18.3. What is the capacity Which every pixel is adjacent to 
of this channel, in bits per pixel, for large N? tout black-añd fonr white pixels. 











> 18.2 Simple language models 
The Zipf-Mandelbrot distribution 


The crudest model for a language is the monogram model, which asserts that 
each successive word is drawn independently from a distribution over words. 
What is the nature of this distribution over words? 

Zipf’s law (Zipf, 1949) asserts that the probability of the rth most probable 
word in a language is approximately 


P(r) =— (18.9) 


where the exponent a has a value close to 1, and « is a constant. According 
to Zipf, a log-log plot of frequency versus word-rank should show a straight 
line with slope —a. 

Mandelbrot’s (1982) modification of Zipf’s law introduces a third param- 
eter v, asserting that the probabilities are given by 


K 


P(r) = F 


(18.10) 
For some documents, such as Jane Austen’s Emma, the Zipf-Mandelbrot dis- 
tribution fits well — figure 18.4. 

Other documents give distributions that are not so well fitted by a Zipf- 
Mandelbrot distribution. Figure 18.5 shows a plot of frequency versus rank for 
the ATEX source of this book. Qualitatively, the graph is similar to a straight 
line, but a curve is noticeable. To be fair, this source file is not written in 
pure English — it is a mix of English, maths symbols such as ‘az’, and IATRX 
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The Dirichlet process 


Assuming we are interested in monogram models for languages, what model 
should we use? One difficulty in modelling a language is the unboundedness 
of vocabulary. The greater the sample of language, the greater the number 
of words encountered. A generative model for a language should emulate 
this property. If asked ‘what is the next word in a newly-discovered work 
of Shakespeare? our probability distribution over words must surely include 
some non-zero probability for words that Shakespeare never used before. Our 
generative monogram model for language should also satisfy a consistency 
rule called exchangeability. If we imagine generating a new language from 
our generative model, producing an ever-growing corpus of text, all statistical 
properties of the text should be homogeneous: the probability of finding a 
particular word at a given location in the stream of text should be the same 
everywhere in the stream. 

The Dirichlet process model is a model for a stream of symbols (which we 
think of as ‘words’) that satisfies the exchangeability rule and that allows the 
vocabulary of symbols to grow without limit. The model has one parameter 
a. As the stream of symbols is produced, we identify each new symbol by a 
unique integer w. When we have seen a stream of length F symbols, we define 
the probability of the next symbol in terms of the counts {Fw} of the symbols 
seen so far thus: the probability that the next symbol is a new symbol, never 


seen before, is 
a 








$ 18.11 
F+a ( ) 
The probability that the next symbol is symbol w is 
Fw 
; 18.12 
F+a ( ) 


Figure 18.6 shows Zipf plots (i.e., plots of symbol frequency versus rank) for 
million-symbol ‘documents’ generated by Dirichlet process priors with values 
of a ranging from 1 to 1000. 

It is evident that a Dirichlet process is not an adequate model for observed 
distributions that roughly obey Zipf’s law. 
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With a small tweak, however, Dirichlet processes can produce rather nice 
Zipf plots. Imagine generating a language composed of elementary symbols 
using a Dirichlet process with a rather small value of the parameter a, so that 
the number of reasonably frequent symbols is about 27. If we then declare 
one of those symbols (now called ‘characters’ rather than words) to be a space 
character, then we can identify the strings between the space characters as 
‘words’. If we generate a language in this way then the frequencies of words 
often come out as very nice Zipf plots, as shown in figure 18.7. Which character 
is selected as the space character determines the slope of the Zipf plot — a less 
probable space character gives rise to a richer language with a shallower slope. 


> 18.3 Units of information content 


The information content of an outcome, x, whose probability is P(x), is defined 
to be 


hay 68 7 (18.13) 


The entropy of an ensemble is an average information content, 
1 
H(X) = P(x) log ——. 18.14 
(X) = E Plo oe Br (18.14) 


When we compare hypotheses with each other in the light of data, it is of- 
ten convenient to compare the log of the probability of the data under the 
alternative hypotheses, 


‘log evidence for Hy = log P(D | Hi), (18.15) 
or, in the case where just two hypotheses are being compared, we evaluate the 
‘log odds’, 

P(D|H1) 
log ————., 18.16 
E PDTT) ook 


which has also been called the ‘weight of evidence in favour of H1’. The 
log evidence for a hypothesis, log P(D|H;) is the negative of the information 
content of the data D: if the data have large information content, given a hy- 
pothesis, then they are surprising to that hypothesis; if some other hypothesis 
is not so surprised by the data, then that hypothesis becomes more probable. 
‘Information content’, ‘surprise value’, and log likelihood or log evidence are 
the same thing. 

All these quantities are logarithms of probabilities, or weighted sums of 
logarithms of probabilities, so they can all be measured in the same units. 
The units depend on the choice of the base of the logarithm. 

The names that have been given to these units are shown in table 18.8. 
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Unit Expression that has those units Table 18.8. Units of measurement 
of information content. 
bit logs p 
nat loge p 
ban logio P 
deciban (db) 10 log10 p 


The bit is the unit that we use most in this book. Because the word ‘bit’ 
has other meanings, a backup name for this unit is the shannon. A byte is 
8 bits. A megabyte is 270 ~ 10° bytes. If one works in natural logarithms, 
information contents and weights of evidence are measured in nats. The most 
interesting units are the ban and the deciban. 


The history of the ban 


Let me tell you why a factor of ten in probability is called a ban. When Alan 
Turing and the other codebreakers at Bletchley Park were breaking each new 
day’s Enigma code, their task was a huge inference problem: to infer, given 
the day’s cyphertext, which three wheels were in the Enigma machines that 
day; what their starting positions were; what further letter substitutions were 
in use on the steckerboard; and, not least, what the original German messages 
were. These inferences were conducted using Bayesian methods (of course!), 
and the chosen units were decibans or half-decibans, the deciban being judged 
the smallest weight of evidence discernible to a human. The evidence in favour 
of particular hypotheses was tallied using sheets of paper that were specially 
printed in Banbury, a town about 30 miles from Bletchley. The inference task 
was known as Banburismus, and the units in which Banburismus was played 
were called bans, after that town. 


> 18.4 A taste of Banburismus 


The details of the code-breaking methods of Bletchley Park were kept secret 
for a long time, but some aspects of Banburismus can be pieced together. 
I hope the following description of a small part of Banburismus is not too 
inaccurate.! 

How much information was needed? The number of possible settings of 
the Enigma machine was about 8 x 1012. To deduce the state of the machine, 
‘it was therefore necessary to find about 129 decibans from somewhere’, as 
Good puts it. Banburismus was aimed not at deducing the entire state of the 
machine, but only at figuring out which wheels were in use; the logic-based 
bombes, fed with guesses of the plaintext (cribs), were then used to crack what 
the settings of the wheels were. 

The Enigma machine, once its wheels and plugs were put in place, im- 
plemented a continually-changing permutation cypher that wandered deter- 
ministically through a state space of 26° permutations. Because an enormous 
number of messages were sent each day, there was a good chance that what- 
ever state one machine was in when sending one character of a message, there 
would be another machine in the same state while sending a particular char- 
acter in another message. Because the evolution of the machine’s state was 
deterministic, the two machines would remain in the same state as each other 


‘lve been most helped by descriptions given by Tony Sale (http://www. 
codesandciphers.org.uk/lectures/) and by Jack Good (1979), who worked with Turing 
at Bletchley. 
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for the rest of the transmission. The resulting correlations between the out- 
puts of such pairs of machines provided a dribble of information-content from 
which Turing and his co-workers extracted their daily 129 decibans. 


How to detect that two messages came from machines with a common 
state sequence 


The hypotheses are the null hypothesis, Ho, which states that the machines 
are in different states, and that the two plain messages are unrelated; and the 
‘match’ hypothesis, Hı, which says that the machines are in the same state, 
and that the two plain messages are unrelated. No attempt is being made 
here to infer what the state of either machine is. The data provided are the 
two cyphertexts x and y; let’s assume they both have length T and that the 
alphabet size is A (26 in Enigma). What is the probability of the data, given 
the two hypotheses? 
First, the null hypothesis. This hypothesis asserts that the two cyphertexts 
are given by 
X = 11 ©2273... = Cı (u1)c2(u2)c3 (u3)... (18.17) 


and 
Y = Y1y2y3-.. = cı (v1) ch (v2)c(v3) sis (18.18) 


where the codes c; and c, are two unrelated time-varying permutations of the 
alphabet, and u,ugu3... and v,v2u3... are the plaintext messages. An exact 
computation of the probability of the data (x,y) would depend on a language 
model of the plain text, and a model of the Enigma machine’s guts, but if we 
assume that each Enigma machine is an ideal random time-varying permuta- 
tion, then the probability distribution of the two cyphertexts is uniform. All 
cyphertexts are equally likely. 


2T 
P(x,y | Ho) = (=) for all x,y of length T. (18.19) 


What about H1? This hypothesis asserts that a single time-varying permuta- 
tion c; underlies both 


X = £10973... = Cı (u1 )c2(u2)c3 (u3)... (18.20) 


and 
Y = yiyoys -.. = cı (v1)c2(v2)ca (v3)... . (18.21) 


What is the probability of the data (x,y)? We have to make some assumptions 
about the plaintext language. If it were the case that the plaintext language 
was completely random, then the probability of ujugu3... and vyvev3... would 
be uniform, and so would that of x and y, so the probability P(x,y|H1) 
would be equal to P(x,y | Ho), and the two hypotheses Ho and Hı would be 
indistinguishable. 

We make progress by assuming that the plaintext is not completely ran- 
dom. Both plaintexts are written in a language, and that language has redun- 
dancies. Assume for example that particular plaintext letters are used more 
often than others. So, even though the two plaintext messages are unrelated, 
they are slightly more likely to use the same letters as each other; if H1 is true, 
two synchronized letters from the two cyphertexts are slightly more likely to 
be identical. Similarly, if a language uses particular bigrams and trigrams 
frequently, then the two plaintext messages will occasionally contain the same 
bigrams and trigrams at the same time as each other, giving rise, if H1 is true, 
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u_—_— LITTLE-JACK-HORNER-SAT- IN-THE-CORNER-EAT ING-A-CHRISTMAS-PIE--HE-PUT-IN-H 
v RIDE-A-COCK-HORSE-TO-BANBURY-CROSS-TO-SEE-A-F INE-LADY-UPON-A-WHITE-HORSE 
matches:  .*... OK. kkk, K, eee Heed acd Teel * 


Table 18.9. Two aligned pieces of 
English plaintext, u and v, with 
matches marked by *. Notice that 
there are twelve matches, 


to a little burst of 2 or 3 identical letters. Table 18.9 shows such a coinci- 
dence in two plaintext messages that are unrelated, except that they are both 


written in English. including a run of six, whereas the 
The codebreakers hunted among pairs of messages for pairs that were sus- expected number of matches in 
piciously similar to each other, counting up the numbers of matching mono- two completely random strings of 


grams, bigrams, trigrams, etc. This method was first used by the Polish length T = 74 would be about 3. 


codebreaker Rejewski. Thetwo Gol respond Ine eae 

cyphertexts from two machines in 
identical states would also have 
how long a message is needed to be able to decide whether two machines twelve matches. 


Let’s look at the simple case of a monogram language model and estimate 


are in the same state. Pll assume the source language is monogram-English, 
the language in which successive letters are drawn i.i.d. from the probability 
distribution {p;} of figure 2.1. The probability of x and y is nonuniform: 
consider two single characters, x = c:(uz) and ye = clv); the probability 
that they are identical is 


XO P(ut)P(vt) Lut =] = DP =m. (18.22) 


Ut Vt 


We give this quantity the name m, for ‘match probability’; for both English 
and German, m is about 2/26 rather than 1/26 (the value that would hold 
for a completely random language). Assuming that c; is an ideal random 
permutation, the probability of x; and y is, by symmetry, 


m 


or if t4 = yt 


P(xe, yt |Hi) = afn) 


18.23 
AA) for £i Æ yz. ( ) 





Given a pair of cyphertexts x and y of length T that match in M places and 
do not match in N places, the log evidence in favour of Hı is then 





(1=m) 
P(x,y|H1) m/A A(A-1) 
lop ni, M1 -NI 18.24 
8 Pany lHo) Mayan? NPS aaa ee) 
= MlogmA + N log A, (18.25) 


Every match contributes log mA in favour of H1; every non-match contributes 


log “on in favour of Ho. 


Match probability for monogram-English m 0.076 
Coincidental match probability 1/A 0.037 
log-evidence for Hı per match 10 log;g mA 3.1 db 
log-evidence for Hı per non-match 10 log jg ee —0.18 db 


If there were M = 4 matches and N = 47 non-matches in a pair of length 
T = 51, for example, the weight of evidence in favour of Hı would be +4 
decibans, or a likelihood ratio of 2.5 to 1 in favour. 

The expected weight of evidence from a line of text of length T = 20 
characters is the expectation of (18.25), which depends on whether Hı or Ho 
is true. If Hı is true then matches are expected to turn up at rate m, and the 
expected weight of evidence is 1.4decibans per 20 characters. If Ho is true 
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then spurious matches are expected to turn up at rate 1/A, and the expected 
weight of evidence is —1.1 decibans per 20 characters. Typically, roughly 400 
characters need to be inspected in order to have a weight of evidence greater 
than a hundred to one (20 decibans) in favour of one hypothesis or the other. 
So, two English plaintexts have more matches than two random strings. 
Furthermore, because consecutive characters in English are not independent, 
the bigram and trigram statistics of English are nonuniform and the matches 
tend to occur in bursts of consecutive matches. [The same observations also 
apply to German.] Using better language models, the evidence contributed 
by runs of matches was more accurately computed. Such a scoring system 
was worked out by Turing and refined by Good. Positive results were passed 
on to automated and human-powered codebreakers. According to Good, the 
longest false-positive that arose in this work was a string of 8 consecutive 
matches between two machines that were actually in unrelated states. 


Further reading 


For further reading about Turing and Bletchley Park, see Hodges (1983) and 
Good (1979). For an in-depth read about cryptography, Schneier’s (1996) 
book is highly recommended. It is readable, clear, and entertaining. 


> 18.5 Exercises 


> Exercise 18.3.7] Another weakness in the design of the Enigma machine, 


which was intended to emulate a perfectly random time-varying permu- 
tation, is that it never mapped a letter to itself. When you press Q, what 
comes out is always a different letter from Q. How much information per 
character is leaked by this design flaw? How long a crib would be needed 
to be confident that the crib is correctly aligned with the cyphertext? 
And how long a crib would be needed to be able confidently to identify 
the correct key? 


[A crib is a guess for what the plaintext was. Imagine that the Brits 
know that a very important German is travelling from Berlin to Aachen, 
and they intercept Enigma-encoded messages sent to Aachen. It is a 
good bet that one or more of the original plaintext messages contains 
the string OBERSTURMBANNFUEHRERXGRAFXHEINRICHXVONXWEIZSAECKER, 
the name of the important chap. A crib could be used in a brute-force 
approach to find the correct Enigma key (feed the received messages 
through all possible Engima machines and see if any of the putative 
decoded texts match the above plaintext). This question centres on the 
idea that the crib can also be used in a much less expensive manner: 
slide the plaintext crib along all the encoded messages until a perfect 
mismatch of the crib and the encoded message is found; if correct, this 
alignment then tells you a lot about the key.] 
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Why have Sex? Information Acquisition 
and Evolution 


Evolution has been happening on earth for about the last 10° years. Un- 
deniably, information has been acquired during this process. Thanks to the 
tireless work of the Blind Watchmaker, some cells now carry within them all 
the information required to be outstanding spiders; other cells carry all the 
information required to make excellent octopuses. Where did this information 
come from? 

The entire blueprint of all organisms on the planet has emerged in a teach- 
ing process in which the teacher is natural selection: fitter individuals have 
more progeny, the fitness being defined by the local environment (including 
the other organisms). The teaching signal is only a few bits per individual: an 
individual simply has a smaller or larger number of grandchildren, depending 
on the individual’s fitness. ‘Fitness’ is a broad term that could cover 


e the ability of an antelope to run faster than other antelopes and hence 
avoid being eaten by a lion; 


e the ability of a lion to be well-enough camouflaged and run fast enough 
to catch one antelope per day; 


e the ability of a peacock to attract a peahen to mate with it; 
e the ability of a peahen to rear many young simultaneously. 


The fitness of an organism is largely determined by its DNA — both the coding 
regions, or genes, and the non-coding regions (which play an important role 
in regulating the transcription of genes). We’ll think of fitness as a function 
of the DNA sequence and the environment. 

How does the DNA determine fitness, and how does information get from 
natural selection into the genome? Well, if the gene that codes for one of an 
antelope’s proteins is defective, that antelope might get eaten by a lion early 
in life and have only two grandchildren rather than forty. The information 
content of natural selection is fully contained in a specification of which off- 
spring survived to have children — an information content of at most one bit 
per offspring. The teaching signal does not communicate to the ecosystem 
any description of the imperfections in the organism that caused it to have 
fewer children. The bits of the teaching signal are highly redundant, because, 
throughout a species, unfit individuals who are similar to each other will be 
failing to have offspring for similar reasons. 

So, how many bits per generation are acquired by the species as a whole 
by natural selection? How many bits has natural selection succeeded in con- 
veying to the human branch of the tree of life, since the divergence between 
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Australopithecines and apes 4000000 years ago? Assuming a generation time 
of 10 years for reproduction, there have been about 400000 generations of 
human precursors since the divergence from apes. Assuming a population of 
10° individuals, each receiving a couple of bits of information from natural 
selection, the total number of bits of information responsible for modifying 
the genomes of 4 million B.C. into today’s human genome is about 8 x 1014 
bits. However, as we noted, natural selection is not smart at collating the 
information that it dishes out to the population, and there is a great deal of 
redundancy in that information. If the population size were twice as great, 
would it evolve twice as fast? No, because natural selection will simply be 
correcting the same defects twice as often. 

John Maynard Smith has suggested that the rate of information acquisition 
by a species is independent of the population size, and is of order 1 bit per 
generation. This figure would allow for only 400000 bits of difference between 
apes and humans, a number that is much smaller than the total size of the 
human genome — 6 x 10° bits. [One human genome contains about 3 x 10° 
nucleotides.] It is certainly the case that the genomic overlap between apes 
and humans is huge, but is the difference that small? 

In this chapter, we’ll develop a crude model of the process of information 
acquisition through evolution, based on the assumption that a gene with two 
defects is typically likely to be more defective than a gene with one defect, and 
an organism with two defective genes is likely to be less fit than an organism 
with one defective gene. Undeniably, this is a crude model, since real biological 
systems are baroque constructions with complex interactions. Nevertheless, 
we persist with a simple model because it readily yields striking results. 

What we find from this simple model is that 


1. John Maynard Smith’s figure of 1 bit per generation is correct for an 
asexually-reproducing population; 


2. in contrast, if the species reproduces sexually, the rate of information 
acquisition can be as large as vG bits per generation, where G is the 
size of the genome. 


We'll also find interesting results concerning the maximum mutation rate 
that a species can withstand. 


> 19.1 The model 


We study a simple model of a reproducing population of N individuals with 
a genome of size G bits: variation is produced by mutation or by recombina- 
tion (i.e., sex) and truncation selection selects the N fittest children at each 
generation to be the parents of the next. We find striking differences between 
populations that have recombination and populations that do not. 

The genotype of each individual is a vector x of G bits, each having a good 
state x, =1 and a bad state zg =0. The fitness F(x) of an individual is simply 
the sum of her bits: 


G 
Fa) =X ao. (19.1) 
g=1 


The bits in the genome could be considered to correspond either to genes 
that have good alleles (x =1) and bad alleles (x,=0), or to the nucleotides 
of a genome. We will concentrate on the latter interpretation. The essential 
property of fitness that we are assuming is that it is locally a roughly linear 
function of the genome, that is, that there are many possible changes one 
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could make to the genome, each of which has a small effect on fitness, and 
that these effects combine approximately linearly. 

We define the normalized fitness f(x) = F(x)/G. 

We consider evolution by natural selection under two models of variation. 


Variation by mutation. The model assumes discrete generations. At each 
generation, t, every individual produces two children. The children’s 
genotypes differ from the parent’s by random mutations. Natural selec- 
tion selects the fittest N progeny in the child population to reproduce, 
and a new generation starts. 


[The selection of the fittest N individuals at each generation is known 
as truncation selection.] 


The simplest model of mutations is that the child’s bits {xg} are in- 
dependent. Each bit has a small probability of being flipped, which, 
thinking of the bits as corresponding roughly to nucleotides, is taken to 
be a constant m, independent of xg. [If alternatively we thought of the 
bits as corresponding to genes, then we would model the probability of 
the discovery of a good gene, P(a,=0 — x,=1), as being a smaller 
number than the probability of a deleterious mutation in a good gene, 
P(tg=1-— x,=0).| 


Variation by recombination (or crossover, or sex). Our organisms are 
haploid, not diploid. They enjoy sex by recombination. The N individ- 
uals in the population are married into M = N/2 couples, at random, 
and each couple has C children — with C=4 children being our stan- 
dard assumption, so as to have the population double and halve every 
generation, as before. The C children’s genotypes are independent given 
the parents’. Each child obtains its genotype z by random crossover of 
its parents’ genotypes, x and y. The simplest model of recombination 
has no linkage, so that: 


ye cs { £g with probability 1/2 (19.2) 


Yg with probability 1/2. 


Once the MC progeny have been born, the parents pass away, the fittest 
N progeny are selected by natural selection, and a new generation starts. 


We now study these two models of variation in detail. 


> 19.2 Rate of increase of fitness 


Theory of mutations 


We assume that the genotype of an individual with normalized fitness f = F/G 
is subjected to mutations that flip bits with probability m. We first show that 
if the average normalized fitness f of the population is greater than 1/2, then 
the optimal mutation rate is small, and the rate of acquisition of information 
is at most of order one bit per generation. 

Since it is easy to achieve a normalized fitness of f =1/2 by simple muta- 
tion, we’ll assume f > 1/2 and work in terms of the excess normalized fitness 
of = f — 1/2. If an individual with excess normalized fitness df has a child 
and the mutation rate m is small, the probability distribution of the excess 
normalized fitness of the child has mean 


Of chia = (1 — 2m) éf (19.3) 
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and variance 
m(l—m) m 
RAS eae Ae 


G G 
If the population of parents has mean df(t) and variance o?(t) = Bm/G, then 
the child population, before selection, will have mean (1 — 2m)df(t) and vari- 
ance (1+ 8)m/G. Natural selection chooses the upper half of this distribution, 
so the mean fitness and variance of fitness at the next generation are given by 


of (t+1) = (1 — 2m)é6f(t) +a y0 F Bye, (19.5) 


(19.4) 


o?(t+1) = H+ BS, (19.6) 


where a is the mean deviation from the mean, measured in standard devia- 
tions, and y is the factor by which the child distribution’s variance is reduced 
by selection. The numbers a and y are of order 1. For the case of a Gaussian 
distribution, a = /2/m ~ 0.8 and y = (1 — 2/7) ~ 0.36. If we assume that 
the variance is in dynamic equilibrium, i.e., ¢?(f+1) ~ o?(t), then 


1 
y(1 + 8) = B, so (1+ 8) = =, (19.7) 
and the factor ay/(1 + 8) in equation (19.5) is equal to 1, if we take the results 
for the Gaussian distribution, an approximation that becomes poorest when 
the discreteness of fitness becomes important, i.e., for small m. The rate of 
increase of normalized fitness is thus: 


df m 
am —2m of + (2 (19.8) 


which, assuming G(ôf)? >> 1, is maximized for 


Q 


1 
Mopt = TEGO?’ (19.9) 
at which point, 
df 1 
=e a, 19.10 
(4 R 8G(ôf) ( ) 
So the rate of increase of fitness F = fG is at most 
a ti (19.11) 
— = —— per generation. ; 
a Br) PA 8S 


For a population with low fitness (df < 0.125), the rate of increase of fitness 
may exceed 1 unit per generation. Indeed, if of < 1/WG, the rate of increase, if 
m= /, is of order WG; this initial spurt can last only of order VG generations. 
For ôf > 0.125, the rate of increase of fitness is smaller than one per generation. 
As the fitness approaches G, the optimal mutation rate tends to m=1/(4G), so 
that an average of 1/4 bits are flipped per genotype, and the rate of increase of 
fitness is also equal to 1/4; information is gained at a rate of about 0.5 bits per 
generation. It takes about 2G generations for the genotypes of all individuals 
in the population to attain perfection. 

For fixed m, the fitness is given by 


1 


= — pr ; 
= Ware ), (19.12) 


of (t) 
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subject to the constraint df(t) < 1/2, where c is a constant of integration, 
equal to 1 if f(0) = 1/2. If the mean number of bits flipped per genotype, 
mG, exceeds 1, then the fitness F approaches an equilibrium value Feqm = 
(1/2 + 1/(2VmG))G. 

This theory is somewhat inaccurate in that the true probability distribu- 
tion of fitness is non-Gaussian, asymmetrical, and quantized to integer values. 
All the same, the predictions of the theory are not grossly at variance with 
the results of simulations described below. 


Theory of sex 


The analysis of the sexual population becomes tractable with two approxi- 
mations: first, we assume that the gene-pool mixes sufficiently rapidly that 
correlations between genes can be neglected; second, we assume homogeneity, 
i.e., that the fraction fg of bits g that are in the good state is the same, f(t), 
for all g. 

Given these assumptions, if two parents of fitness F = fG mate, the prob- 
ability distribution of their children’s fitness has mean equal to the parents’ 
fitness, F’; the variation produced by sex does not reduce the average fitness. 
The standard deviation of the fitness of the children scales as \/Gf(1— f). 
Since, after selection, the increase in fitness is proportional to this standard 
deviation, the fitness increase per generation scales as the square root of the 
size of the genome, VG. As shown in box 19.2, the mean fitness F= fG 
evolves in accordance with the differential equation: 

Œ x ny TOO TOG. 


where n = \/2/(a + 2). The solution of this equation is 


(19.13) 


f(t) = h +sin (e+ 3)| , forttce (-3VG/n. 5VG/n), (19.14) 


where c is a constant of integration, c = sin~1(2f(0) — 1). So this idealized 
system reaches a state of eugenic perfection (f = 1) within a finite time: 


(x/n)VG generations. 


Simulations 


Figure 19.3a shows the fitness of a sexual population of N = 1000 individ- 
uals with a genome size of G = 1000 starting from a random initial state 
with normalized fitness 0.5. It also shows the theoretical curve f(t)G from 
equation (19.14), which fits remarkably well. 

In contrast, figures 19.3(b) and (c) show the evolving fitness when variation 
is produced by mutation at rates m = 0.25/G and m = 6/G respectively. Note 
the difference in the horizontal scales from panel (a). 





Figure 19.1. Why sex is better 
than sex-free reproduction. If 
mutations are used to create 
variation among children, then it 
is unavoidable that the average 
fitness of the children is lower 
than the parents’ fitness; the 
greater the variation, the greater 
the average deficit. Selection 
bumps up the mean fitness again. 
In contrast, recombination 
produces variation without a 
decrease in average fitness. The 
typical amount of variation scales 
as VG, where G is the genome 
size, so after selection, the average 
fitness rises by O(/G). 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


274 19 — Why have Sex? Information Acquisition and Evolution 


Box 19.2. Details of the theory of 
How does f(t+1) depend on f(t)? Let’s first assume the two parents of a child both sex. 

have exactly f(t)G good bits, and, by our homogeneity assumption, that those bits are 

independent random subsets of the G bits. The number of bits that are good in both 

parents is roughly f(t)?G, and the number that are good in one parent only is roughly 

2f (t)(1—f (t))G, so the fitness of the child will be f(t)?G plus the sum of 2f(t)(1—f(t))G 

fair coin flips, which has a binomial distribution of mean f(t)(1 — f(¢))G and variance 

+ f(t)(1— f(t))G. The fitness of a child is thus roughly distributed as 


Foriia ~ Normal (mean = f(t)G, variance = sf -= 10)a) , 


The important property of this distribution, contrasted with the distribution under 
mutation, is that the mean fitness is equal to the parents’ fitness; the variation produced 
by sex does not reduce the average fitness. 

If we include the parental population’s variance, which we will write as o° (t) = 
BOES — F(t))G, the children’s fitnesses are distributed as 


Fiia ~ Normal (mean = f(t)G, variance = (1 + £) OA — (nc) . 


Natural selection selects the children on the upper side of this distribution. The mean 
increase in fitness will be 


F(t+1) — F(t) = [a(1 + 8/2 PVA VOS- G, 


and the variance of the surviving children will be 





o%(t+ 1) = (1+ H/DSAOA— FOG, 


where a = \/2/m and y = (1—2/r). If there is dynamic equilibrium [o?(t+1) = 0?(t)| 
then the factor in (19.2) is 


1/2 = 2 


Defining this constant to be n = ./2/(m + 2), we conclude that, under sex and natural 


selection, the mean fitness of the population increases at a rate proportional to the 
square root of the size of the genome, 


an x ny f) — f(t))G bits per generation. 
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generated genomes with 
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f f (sex) can tolerate far greater 
mutation rates. 


Exercise 19.1.[% P-280] Dependence on population size. How do the results for 
a sexual population depend on the population size? We anticipate that 
there is a minimum population size above which the theory of sex is 
accurate. How is that minimum population size related to G? 


Exercise 19.2.15] Dependence on crossover mechanism. In the simple model of 
sex, each bit is taken at random from one of the two parents, that is, we 
allow crossovers to occur with probability 50% between any two adjacent 
nucleotides. How is the model affected (a) if the crossover probability is 
smaller? (b) if crossovers occur exclusively at hot-spots located every d 
bits along the genome? 


> 19.3 The maximal tolerable mutation rate 


What if we combine the two models of variation? What is the maximum 
mutation rate that can be tolerated by a species that has sex? 
The rate of increase of fitness is given by 


df m+ f(l—f)/2 


q ~ MY H nv2 G ; 





(19.15) 
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which is positive if the mutation rate satisfies 


0-7) 

a 
Let us compare this rate with the result in the absence of sex, which, from 
equation (19.8), is that the maximum tolerable mutation rate is 
1 1 
G (26f)?" 

The tolerable mutation rate with sex is of order vG times greater than that 
without sex! 

A parthenogenetic (non-sexual) species could try to wriggle out of this 
bound on its mutation rate by increasing its litter sizes. But if mutation flips 
on average mG bits, the probability that no bits are flipped in one genome 
is roughly e~@, so a mother needs to have roughly e’@ offspring in order 
to have a good chance of having one child with the same fitness as her. The 
litter size of a non-sexual species thus has to be exponential in mG (if mG is 
bigger than 1), if the species is to persist. 

So the maximum tolerable mutation rate is pinned close to 1/G, for a non- 
sexual species, whereas it is a larger number of order 1/ VG, for a species with 
recombination. 

Turning these results around, we can predict the largest possible genome 
size for a given fixed mutation rate, m. For a parthenogenetic species, the 
largest genome size is of order 1/m, and for a sexual species, 1/m?. Taking 
the figure m = 1078 as the mutation rate per nucleotide per generation (Eyre- 
Walker and Keightley, 1999), and allowing for a maximum brood size of 20000 
(that is, mG ~ 10), we predict that all species with more than G = 10° coding 
nucleotides make at least occasional use of recombination. If the brood size is 
12, then this number falls to G = 2.5 x 108. 


m<n (19.16) 


m< (19.17) 


> 19.4 Fitness increase and information acquisition 


For this simple model it is possible to relate increasing fitness to information 
acquisition. 

If the bits are set at random, the fitness is roughly F = G/2. If evolution 
leads to a population in which all individuals have the maximum fitness F = G, 
then G bits of information have been acquired by the species, namely for each 
bit xg, the species has figured out which of the two states is the better. 

We define the information acquired at an intermediate fitness to be the 
amount of selection (measured in bits) required to select the perfect state 
from the gene pool. Let a fraction fy of the population have 7,=1. Because 
logs(1/f) is the information required to find a black ball in an urn containing 
black and white balls in the ratio f : 1—f, we define the information acquired 


to be F 
I= 2 og {yas (19.18) 
If all the fractions fg are equal to F/G, then 
2F 
I = Glogs — 19.19 
O82 G ’ ( ) 
which is well approximated by 
Ï =F = G72). (19.20) 


The rate of information acquisition is thus roughly two times the rate of in- 
crease of fitness in the population. 
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> 19.5 Discussion 


These results quantify the well known argument for why species reproduce 
by sex with recombination, namely that recombination allows useful muta- 
tions to spread more rapidly through the species and allows deleterious muta- 
tions to be more rapidly cleared from the population (Maynard Smith, 1978; 
Felsenstein, 1985; Maynard Smith, 1988; Maynard Smith and Szdthmary, 
1995). A population that reproduces by recombination can acquire informa- 
tion from natural selection at a rate of order VG times faster than a partheno- 
genetic population, and it can tolerate a mutation rate that is of order VG 
times greater. For genomes of size G ~ 108 coding nucleotides, this factor of 
VG is substantial. 

This enormous advantage conferred by sex has been noted before by Kon- 
drashov (1988), but this meme, which Kondrashov calls ‘the deterministic 
mutation hypothesis’, does not seem to have diffused throughout the evolu- 
tionary research community, as there are still numerous papers in which the 
prevalence of sex is viewed as a mystery to be explained by elaborate mecha- 
nisms. 


‘The cost of males’ — stability of a gene for sex or parthenogenesis 


Why do people declare sex to be a mystery? The main motivation for being 
mystified is an idea called the ‘cost of males’. Sexual reproduction is disad- 
vantageous compared with asexual reproduction, it’s argued, because of every 
two offspring produced by sex, one (on average) is a useless male, incapable 
of child-bearing, and only one is a productive female. In the same time, a 
parthenogenetic mother could give birth to two female clones. To put it an- 
other way, the big advantage of parthenogenesis, from the point of view of 
the individual, is that one is able to pass on 100% of one’s genome to one’s 
children, instead of only 50%. Thus if there were two versions of a species, one 
reproducing with and one without sex, the single mothers would be expected 
to outstrip their sexual cousins. The simple model presented thus far did not 
include either genders or the ability to convert from sexual reproduction to 
asexual, but we can easily modify the model. 

We modify the model so that one of the G bits in the genome determines 
whether an individual prefers to reproduce parthenogenetically (x = 1) or sex- 
ually (c=0). The results depend on the number of children had by a single 
parthenogenetic mother, Kp and the number of children born by a sexual 
couple, Ks. Both (A, =2, Ks=4) and (K,=4, Ks=4) are reasonable mod- 
els. The former (K,=2, K,=4) would seem most appropriate in the case 
of unicellular organisms, where the cytoplasm of both parents goes into the 
children. The latter (Kp =4, Ks =4) is appropriate if the children are solely 
nurtured by one of the parents, so single mothers have just as many offspring 
as a sexual pair. I concentrate on the latter model, since it gives the greatest 
advantage to the parthenogens, who are supposedly expected to outbreed the 
sexual community. Because parthenogens have four children per generation, 
the maximum tolerable mutation rate for them is twice the expression (19.17) 
derived before for Kp =2. If the fitness is large, the maximum tolerable rate 
is mG ~ 2. 

Initially the genomes are set randomly with F = G/2, with half of the pop- 
ulation having the gene for parthenogenesis. Figure 19.5 shows the outcome. 
During the ‘learning’ phase of evolution, in which the fitness is increasing 
rapidly, pockets of parthenogens appear briefly, but then disappear within 
a couple of generations as their sexual cousins overtake them in fitness and 
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(a) mG=4 (b) mG =1 Figure 19.5. Results when there is 
r r 1 a gene for parthenogenesis, and no 
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leave them behind. Once the population reaches its top fitness, however, the 
parthenogens can take over, if the mutation rate is sufficiently low (mG = 1). 

In the presence of a higher mutation rate (mG=4), however, the 
parthenogens never take over. The breadth of the sexual population’s fit- 
ness is of order VG, so a mutant parthenogenetic colony arising with slightly 
above-average fitness will last for about /G/(mG) = 1/(mV/G) generations 
before its fitness falls below that of its sexual cousins. As long as the popu- 
lation size is sufficiently large for some sexual individuals to survive for this 
time, sex will not die out. 

In a sufficiently unstable environment, where the fitness function is con- 
tinually changing, the parthenogens will always lag behind the sexual commu- 
nity. These results are consistent with the argument of Haldane and Hamilton 
(2002) that sex is helpful in an arms race with parasites. The parasites define 
an effective fitness function which changes with time, and a sexual population 
will always ascend the current fitness function more rapidly. 


Additive fitness function 


Of course, our results depend on the fitness function that we assume, and on 
our model of selection. Is it reasonable to model fitness, to first order, as a sum 
of independent terms? Maynard Smith (1968) argues that it is: the more good 
genes you have, the higher you come in the pecking order, for example. The 
directional selection model has been used extensively in theoretical popula- 
tion genetic studies (Bulmer, 1985). We might expect real fitness functions to 
involve interactions, in which case crossover might reduce the average fitness. 
However, since recombination gives the biggest advantage to species whose fit- 
ness functions are additive, we might predict that evolution will have favoured 
species that used a representation of the genome that corresponds to a fitness 
function that has only weak interactions. And even if there are interactions, 
it seems plausible that the fitness would still involve a sum of such interacting 
terms, with the number of terms being some fraction of the genome size G. 
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Exercise 19.3.[9C1 Investigate how fast sexual and asexual species evolve if 
they have a fitness function with interactions. For example, let the fitness 
be a sum of exclusive-ors of pairs of bits; compare the evolving fitnesses 
with those of the sexual and asexual species with a simple additive fitness 
function. 


Furthermore, if the fitness function were a highly nonlinear function of the 
genotype, it could be made more smooth and locally linear by the Baldwin 
effect. The Baldwin effect (Baldwin, 1896; Hinton and Nowlan, 1987) has been 
widely studied as a mechanism whereby learning guides evolution, and it could 
also act at the level of transcription and translation. Consider the evolution of 
a peptide sequence for a new purpose. Assume the effectiveness of the peptide 
is a highly nonlinear function of the sequence, perhaps having a small island 
of good sequences surrounded by an ocean of equally bad sequences. In an 
organism whose transcription and translation machinery is flawless, the fitness 
will be an equally nonlinear function of the DNA sequence, and evolution 
will wander around the ocean making progress towards the island only by a 
random walk. In contrast, an organism having the same DNA sequence, but 
whose DNA-to-RNA transcription or RNA-to-protein translation is ‘faulty’, 
will occasionally, by mistranslation or mistranscription, accidentally produce a 
working enzyme; and it will do so with greater probability if its DNA sequence 
is close to a good sequence. One cell might produce 1000 proteins from the 
one mRNA sequence, of which 999 have no enzymatic effect, and one does. 
The one working catalyst will be enough for that cell to have an increased 
fitness relative to rivals whose DNA sequence is further from the island of 
good sequences. For this reason I conjecture that, at least early in evolution, 
and perhaps still now, the genetic code was not implemented perfectly but was 
implemented noisily, with some codons coding for a distribution of possible 
amino acids. This noisy code could even be switched on and off from cell 
to cell in an organism by having multiple aminoacyl-tRNA synthetases, some 
more reliable than others. 

Whilst our model assumed that the bits of the genome do not interact, 
ignored the fact that the information is represented redundantly, assumed 
that there is a direct relationship between phenotypic fitness and the genotype, 
and assumed that the crossover probability in recombination is high, I believe 
these qualitative results would still hold if more complex models of fitness and 
crossover were used: the relative benefit of sex will still scale as vG. Only in 
small, in-bred populations are the benefits of sex expected to be diminished. 

In summary: Why have sex? Because sex is good for your bits! 


Further reading 


How did a high-information-content self-replicating system ever emerge in the 
first place? In the general area of the origins of life and other tricky ques- 
tions about evolution, I highly recommend Maynard Smith and Szdthmary 
(1995), Maynard Smith and Szdthmary (1999), Kondrashov (1988), May- 
nard Smith (1988), Ridley (2000), Dyson (1985), Cairns-Smith (1985), and 
Hopfield (1978). 


> 19.6 Further exercises 


Exercise 19.4.9] How good must the error-correcting machinery in DNA repli- 
cation be, given that mammals have not all died out long ago? Estimate the 
probability of nucleotide substitution, per cell division. [See Appendix C.4.] 
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Exercise 19.5.!4] Given that DNA replication is achieved by bumbling Brow- 
nian motion and ordinary thermodynamics in a biochemical porridge at a 
temperature of 35C, it’s astonishing that the error-rate of DNA replication 
is about 107° per replicated nucleotide. How can this reliability be achieved, 
given that the energetic difference between a correct base-pairing and an incor- 
rect one is only one or two hydrogen bonds and the thermal energy kT is only 
about a factor of four smaller than the free energy associated with a hydro- 
gen bond? If ordinary thermodynamics is what favours correct base-pairing, 
surely the frequency of incorrect base-pairing should be about 


f =exp(—44/kr), (19.21) 


where AE is the free energy difference, i.e., an error frequency of f ~ 1074? 
How has DNA replication cheated thermodynamics? 

The situation is equally perplexing in the case of protein synthesis, which 
translates an mRNA sequence into a polypeptide in accordance with the ge- 
netic code. Two specific chemical reactions are protected against errors: the 
binding of tRNA molecules to amino acids, and the production of the polypep- 
tide in the ribosome, which, like DNA replication, involves base-pairing. 
Again, the fidelity is high (an error rate of about 1074), and this fidelity 
can’t be caused by the energy of the ‘correct’ final state being especially low 
— the correct polypeptide sequence is not expected to be significantly lower in 
energy than any other sequence. How do cells perform error correction? (See 


Hopfield (1974), Hopfield (1980)). 


Exercise 19.6.?] While the genome acquires information through natural se- 
lection at a rate of a few bits per generation, your brain acquires information 
at a greater rate. 

Estimate at what rate new information can be stored in long term memory 
by your brain. Think of learning the words of a new language, for example. 


> 19.7 Solutions 


Solution to exercise 19.1 (p.275). For small enough N, whilst the average fit- 
ness of the population increases, some unlucky bits become frozen into the 
bad state. (These bad genes are sometimes known as hitchhikers.) The ho- 
mogeneity assumption breaks down. Eventually, all individuals have identical 
genotypes that are mainly 1-bits, but contain some 0-bits too. The smaller 
the population, the greater the number of frozen 0-bits is expected to be. How 
small can the population size N be if the theory of sex is accurate? 

We find experimentally that the theory based on assuming homogeneity fits 
poorly only if the population size N is smaller than ~ VG. If N is significantly 
smaller than vG, information cannot possibly be acquired at a rate as big as 
VG, since the information content of the Blind Watchmaker’s decisions cannot 
be any greater than 2N bits per generation, this being the number of bits 
required to specify which of the 2N children get to reproduce. Baum et al. 
(1995), analyzing a similar model, show that the population size N should be 
about VG(log G)? to make hitchhikers unlikely to arise. 
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Part IV 


Probabilities and Inference 
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About Part IV 


The number of inference problems that can (and perhaps should) be tackled 
by Bayesian inference methods is enormous. In this book, for example, we 
discuss the decoding problem for error-correcting codes, the task of inferring 
clusters from data, the task of interpolation through noisy data, and the task 
of classifying patterns given labelled examples. Most techniques for solving 
these problems can be categorized as follows. 


Exact methods compute the required quantities directly. Only a few inter- 
esting problems have a direct solution, but exact methods are important 
as tools for solving subtasks within larger problems. Methods for the 
exact solution of inference problems are the subject of Chapters 21, 24, 
25, and 26. 


Approximate methods can be subdivided into 


1. deterministic approximations, which include maximum likeli- 
hood (Chapter 22), Laplace’s method (Chapters 27 and 28) and 
variational methods (Chapter 33); and 


2. Monte Carlo methods — techniques in which random numbers 
play an integral part — which will be discussed in Chapters 29, 30, 
and 32. 


This part of the book does not form a one-dimensional story. Rather, the 
ideas make up a web of interrelated threads which will recombine in subsequent 
chapters. 

Chapter 3, which is an honorary member of this part, discussed a range of 
simple examples of inference problems and their Bayesian solutions. 

To give further motivation for the toolbox of inference methods discussed in 
this part, Chapter 20 discusses the problem of clustering; subsequent chapters 
discuss the probabilistic interpretation of clustering as mixture modelling. 

Chapter 21 discusses the option of dealing with probability distributions 
by completely enumerating all hypotheses. Chapter 22 introduces the idea 
of maximization methods as a way of avoiding the large cost associated with 
complete enumeration, and points out reasons why maximum likelihood is 
not good enough. Chapter 23 reviews the probability distributions that arise 
most often in Bayesian inference. Chapters 24, 25, and 26 discuss another 
way of avoiding the cost of complete enumeration: marginalization. Chapter 
25 discusses message-passing methods appropriate for graphical models, using 
the decoding of error-correcting codes as an example. Chapter 26 combines 
these ideas with message-passing concepts from Chapters 16 and 17. These 
chapters are a prerequisite for the understanding of advanced error-correcting 
codes. 

Chapter 27 discusses deterministic approximations including Laplace’s 
method. This chapter is a prerequisite for understanding the topic of complex- 
ity control in learning algorithms, an idea that is discussed in general terms 
in Chapter 28. 
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Chapter 29 discusses Monte Carlo methods. Chapter 30 gives details of 
state-of-the-art Monte Carlo techniques. 

Chapter 31 introduces the Ising model as a test-bed for probabilistic meth- 
ods. An exact message-passing method and a Monte Carlo method are demon- 
strated. A motivation for studying the Ising model is that it is intimately 
related to several neural network models. Chapter 32 describes ‘exact’ Monte 
Carlo methods and demonstrates their application to the Ising model. 

Chapter 33 discusses variational methods and their application to Ising 
models and to simple statistical inference problems including clustering. This 
chapter will help the reader understand the Hopfield network (Chapter 42) 
and the EM algorithm, which is an important method in latent-variable mod- 
elling. Chapter 34 discusses a particularly simple latent variable model called 
independent component analysis. 

Chapter 35 discusses a ragbag of assorted inference topics. Chapter 36 
discusses a simple example of decision theory. Chapter 37 discusses differences 
between sampling theory and Bayesian methods. 


A theme: what inference is about 


A widespread misconception is that the aim of inference is to find the most 
probable explanation for some data. While this most probable hypothesis may 
be of interest, and some inference methods do locate it, this hypothesis is just 
the peak of a probability distribution, and it is the whole distribution that is 
of interest. As we saw in Chapter 4, the most probable outcome from a source 
is often not a typical outcome from that source. Similarly, the most probable 
hypothesis given some data may be atypical of the whole set of reasonably- 
plausible hypotheses. 


About Chapter 20 


Before reading the next chapter, exercise 2.17 (p.36) and section 11.2 (inferring 
the input to a Gaussian channel) are recommended reading. 
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20 





An Example Inference Task: Clustering 


Human brains are good at finding regularities in data. One way of expressing 
regularity is to put a set of objects into groups that are similar to each other. 
For example, biologists have found that most objects in the natural world 
fall into one of two categories: things that are brown and run away, and 
things that are green and don’t run away. The first group they call animals, 
and the second, plants. We’ll call this operation of grouping things together 
clustering. If the biologist further sub-divides the cluster of plants into sub- 
clusters, we would call this ‘hierarchical clustering’; but we won’t be talking 
about hierarchical clustering yet. In this chapter we'll just discuss ways to 
take a set of N objects and group them into K clusters. 


There are several motivations for clustering. First, a good clustering has 
predictive power. When an early biologist encounters a new green thing he has 
not seen before, his internal model of plants and animals fills in predictions for 
attributes of the green thing: it’s unlikely to jump on him and eat him; if he 
touches it, he might get grazed or stung; if he eats it, he might feel sick. All of 
these predictions, while uncertain, are useful, because they help the biologist 
invest his resources (for example, the time spent watching for predators) well. 
Thus, we perform clustering because we believe the underlying cluster labels 
are meaningful, will lead to a more efficient description of our data, and will 
help us choose better actions. This type of clustering is sometimes called 
‘mixture density modelling’, and the objective function that measures how 
well the predictive model is working is the information content of the data, 
log 1/P({x}). 

Second, clusters can be a useful aid to communication because they allow 
lossy compression. The biologist can give directions to a friend such as ‘go to 
the third tree on the right then take a right turn’ (rather than ‘go past the 
large green thing with red berries, then past the large green thing with thorns, 
then ...’). The brief category name ‘tree’ is helpful because it is sufficient to 
identify an object. Similarly, in lossy image compression, the aim is to convey 
in as few bits as possible a reasonable reproduction of a picture; one way to do 
this is to divide the image into N small patches, and find a close match to each 
patch in an alphabet of K image-templates; then we send a close fit to the 
image by sending the list of labels k,,ko,...,kn of the matching templates. 
The task of creating a good library of image-templates is equivalent to finding 
a set of cluster centres. This type of clustering is sometimes called ‘vector 
quantization’. 


We can formalize a vector quantizer in terms of an assignment rule x —> 
k(x) for assigning datapoints x to one of K codenames, and a reconstruction 
rule k + m“), the aim being to choose the functions k(x) and m(*) so as to 
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minimize the expected distortion, which might be defined to be 


p= YP); [mE = x| 3 (20.1) 


[The ideal objective function would be to minimize the psychologically per- 
ceived distortion of the image. Since it is hard to quantify the distortion 
perceived by a human, vector quantization and lossy compression are not so 
crisply defined problems as data modelling and lossless compression.] In vec- 
tor quantization, we don’t necessarily believe that the templates {m} have 
any natural meaning; they are simply tools to do a job. We note in passing 
the similarity of the assignment rule (i.e., the encoder) of vector quantization 
to the decoding problem when decoding an error-correcting code. 

A third reason for making a cluster model is that failures of the cluster 
model may highlight interesting objects that deserve special attention. If 
we have trained a vector quantizer to do a good job of compressing satellite 
pictures of ocean surfaces, then maybe patches of image that are not well 
compressed by the vector quantizer are the patches that contain ships! If the 
biologist encounters a green thing and sees it run (or slither) away, this misfit 





with his cluster model (which says green things don’t run away) cues him sx 
to pay special attention. One can’t spend all one’s time being fascinated by x eX 
things; the cluster model can help sift out from the multitude of objects in aoe > 
one’s world the ones that really deserve attention. ge * g Ta 
A fourth reason for liking clustering algorithms is that they may serve n i 











as models of learning processes in neural systems. The clustering algorithm 
that we now discuss, the K-means algorithm, is an example of a competitive 
learning algorithm. The algorithm works by having the K clusters compete Figure 20.1. N = 40 data points. 
with each other for the right to own the data points. 


> 20.1 K-means clustering 


The K-means algorithm is an algorithm for putting N data points in an J- About the name... As far as I 


dimensional space into K clusters. Each cluster is parameterized by a vector know, the ‘K?’ in K-means 
clustering simply refers to the 


; : (n) i chosen number of clusters. If 
The data points will be denoted by {x} where the superscript n runs News orlleai allowed the-same 


from 1 to the number of data points N. Each vector x has J components xi. naming policy, maybe we would 
We will assume that the space that x lives in is a real space and that we have earn at school about ‘calculus for 


a metric that defines distances between points, for example, the variable x’. It’s a silly name, 
but we are stuck with it. 


m(*) called its mean. 





d(x,y) = DE -= yi)’. (20.2) 


i 


To start the K-means algorithm (algorithm 20.2), the K means {m(*)} 
are initialized in some way, for example to random values. K-means is then 
an iterative two-step algorithm. In the assignment step, each data point n is 
assigned to the nearest mean. In the update step, the means are adjusted to 
match the sample means of the data points that they are responsible for. 

The K-means algorithm is demonstrated for a toy two-dimensional data set 
in figure 20.3, where 2 means are used. The assignments of the points to the 
two clusters are indicated by two point styles, and the two means are shown 
by the circles. The algorithm converges after three iterations, at which point 
the assignments are unchanged so the means remain unmoved when updated. 
The K-means algorithm always converges to a fixed point. 
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Algorithm 20.2. The K-means 
Initialization. Set K means {m“)} to random values. clustering algorithm. 


Assignment step. Each data point n is assigned to the nearest mean. 
We denote our guess for the cluster k”) that the point x”) belongs 
to by k™, 


k™ = argmin{d(m™),x\™)}. (20.3) 
k 


An alternative, equivalent representation of this assignment of 
points to clusters is given by ‘responsibilities’, which are indicator 


(n) (n) 


variables r. In the assignment step, we set r, to one if mean k 


(n) 


is the closest mean to datapoint x("); otherwise rọ 18 Zero. 


i jm) — 
Te 1 if k k (20.4) 


"k 0 if Zk, 


What about ties? — We don’t expect two means to be exactly the 
same distance from a data point, but if a tie does happen, k(”) is 
set to the smallest of the winning {k}. 


Update step. The model parameters, the means, are adjusted to match 
the sample means of the data points that they are responsible for. 


Tra 
where RČ) is the total responsibility of mean k, 


ROS So. (20.6) 


n 


What about means with no responsibilities? — If R® = 0, then we 
leave the mean m“) where it is. 


Repeat the assignment step and update step until the assign- 
ments do not change. 
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Figure 20.3. K-means algorithm 
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Exercise 20.1.4 P291] gee if you can prove that K-means always converges. 
int: find a physical analogy and an associated Lyapunov function. 
Hint: find a physical ] d iated L f i 


[A Lyapunov function is a function of the state of the algorithm that 
decreases whenever the state changes and that is bounded below. If a 
system has a Lyapunov function then its dynamics converge.] 


The K-means algorithm with a larger number of means, 4, is demonstrated in 
figure 20.4. The outcome of the algorithm depends on the initial condition. 
In the first case, after five iterations, a steady state is found in which the data 
points are fairly evenly split between the four clusters. In the second case, 
after six iterations, half the data points are in one cluster, and the others are 
shared among the other three clusters. 


Questions about this algorithm 


The K-means algorithm has several ad hoc features. Why does the update step 
set the ‘mean’ to the mean of the assigned points? Where did the distance d 
come from? What if we used a different measure of distance between x and m? 
How can we choose the ‘best’ distance? [In vector quantization, the distance 
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function is provided as part of the problem definition; but ’'m assuming we 
are interested in data-modelling rather than vector quantization.] How do we 
choose K? Having found multiple alternative clusterings for a given K, how 
can we choose among them? 


Cases where K-means might be viewed as failing. 


Further questions arise when we look for cases where the algorithm behaves 
badly (compared with what the man in the street would call ‘clustering’). 
Figure 20.5a shows a set of 75 data points generated from a mixture of two 
Gaussians. The right-hand Gaussian has less weight (only one fifth of the data 
points), and it is a less broad cluster. Figure 20.5b shows the outcome of using 
K-means clustering with K = 2 means. Four of the big cluster’s data points 
have been assigned to the small cluster, and both means end up displaced 
to the left of the true centres of the clusters. The K-means algorithm takes 
account only of the distance between the means and the data points; it has 
no representation of the weight or breadth of each cluster. Consequently, data 
points that actually belong to the broad cluster are incorrectly assigned to the 
narrow cluster. 

Figure 20.6 shows another case of K-means behaving badly. The data 
evidently fall into two elongated clusters. But the only stable state of the 
K-means algorithm is that shown in figure 20.6b: the two clusters have been 
sliced in half! These two examples show that there is something wrong with 
the distance d in the K-means algorithm. The K-means algorithm has no way 
of representing the size or shape of a cluster. 

A final criticism of K-means is that it is a ‘hard’ rather than a ‘soft 
algorithm: points are assigned to exactly one cluster and all points assigned 
to a cluster are equals in that cluster. Points located near the border between 
two or more clusters should, arguably, play a partial role in determining the 
locations of all the clusters that they could plausibly be assigned to. But in 
the K-means algorithm, each borderline point is dumped in one cluster, and 


20.2: Soft K-means clustering 


has an equal vote with all the other points in that cluster, and no vote in any 
other clusters. 


20.2 Soft K-means clustering 


These criticisms of K-means motivate the ‘soft K-means algorithm’, algo- 
rithm 20.7. The algorithm has one parameter, 6, which we could term the 
stiffness. 


Assignment step. Each data point x) is given a soft ‘degree of as- 
signment’ to each of the means. We call the degree to which x 


is assigned to cluster k the responsibility r™ (the responsibility of 


cluster k for point n). 


A) exp (-8 d(m*), x") 
kT Sy EXP (—Bd(m(’), x())) ° 


The sum of the K responsibilities for the nth point is 1. 


(20.7) 


Update step. The model parameters, the means, are adjusted to match 
the sample means of the data points that they are responsible for. 


Sra 
where RČ) is the total responsibility of mean k, 


RY = 5 ro, 


n 





Notice the similarity of this soft K-means algorithm to the hard K-means 
algorithm 20.2. The update step is identical; the only difference is that the 
responsibilities r) can take on values between 0 and 1. Whereas the assign- 
ment Å”) in the K-means algorithm involved a ‘min’ over the distances, the 


rule for assigning the responsibilities is a ‘soft-min’ (20.7). 


Exercise 20.2.!7] Show that as the stiffness GB goes to ov, the soft K-means algo- 
rithm becomes identical to the original hard K-means algorithm, except 
for the way in which means with no assigned points behave. Describe 
what those means do instead of sitting still. 


Dimensionally, the stiffness 8 is an inverse-length-squared, so we can as- 
sociate a lengthscale, ø = 1//8, with it. The soft K-means algorithm is 
demonstrated in figure 20.8. The lengthscale is shown by the radius of the 
circles surrounding the four means. Each panel shows the final fixed point 
reached for a different value of the lengthscale ø. 


20.3 Conclusion 


At this point, we may have fixed some of the problems with the original K- 
means algorithm by introducing an extra complexity-control parameter 3. But 
how should we set 8? And what about the problem of the elongated clusters, 
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Largeo... Figure 20.8. Soft K-means 
l l l l algorithm, version 1, applied to a 
data set of 40 points. K = 4. 
Implicit lengthscale parameter 
o = 1/8*/2 varied from a large to 
a small value. Each picture shows 
the state of all four means, with 
the implicit lengthscale shown by 
the radius of the four circles, after 
running the algorithm for several 
tens of iterations. At the largest 
lengthscale, all four means 
converge exactly to the data 
mean. Then the four means 
separate into two groups of two. 
At shorter lengthscales, each of 
these pairs itself bifurcates into 
subgroups. 














and the clusters of unequal weight and width? Adding one stiffness parameter 
b is not going to make all these problems go away. 

We’ll come back to these questions in a later chapter, as we develop the 
mixture-density-modelling view of clustering. 


Further reading 


For a vector-quantization approach to clustering see (Luttrell, 1989; Luttrell, 
1990). 


»> 20.4 Exercises 


> Exercise 20.3.19» P-291] Explore the properties of the soft K-means algorithm, 
version 1, assuming that the datapoints {x} come from a single separable 
two-dimensional Gaussian distribution with mean zero and variances 
(var(x1), var(x2)) = (o2, o2), with of > of. Set K = 2, assume N is 
large, and investigate the fixed points of the algorithm as 8 is varied. 
[Hint: assume that m( = (m,0) and m(?) = (—m,0).] 


> Exercise 20.4.19] Consider the soft K-means algorithm applied to a large 
amount of one-dimensional data that comes from a mixture of two equal- 
weight Gaussians with true means u = +1 and standard deviation op, 
for example op = 1. Show that the hard K-means algorithm with K = 2 
leads to a solution in which the two means are further apart than the 
two true means. Discuss what happens for other values of 8, and find -1 1 
the value of 8 such that the soft algorithm puts the two means in the 
correct places. 











20.5: Solutions 


> 20.5 Solutions 


Solution to exercise 20.1 (p.287). We can associate an ‘energy’ with the state 
of the K-means algorithm by connecting a spring between each point x”) and 
the mean that is responsible for it. The energy of one spring is proportional to 
its squared length, namely Bd(x™, m(*)) where £ is the stiffness of the spring. 
The total energy of all the springs is a Lyapunov function for the algorithm, 
because (a) the assignment step can only decrease the energy — a point only 
changes its allegiance if the length of its spring would be reduced; (b) the 
update step can only decrease the energy — moving m“) to the mean is the 
way to minimize the energy of its springs; and (c) the energy is bounded below 
— which is the second condition for a Lyapunov function. Since the algorithm 
has a Lyapunov function, it converges. 


Solution to exercise 20.3 (p.290). If the means are initialized to m®) = (m, 0) 
and m™® = (—m,0), the assignment step for a point at location x1, x2 gives 


exp(—8(a1 — m)?/2) 


nO) = Spal mA) t ap pa emp) COO 
1 
T Trop Ama) RE 
and the updated m is 
je =. fda, P(x) zı rı(x) 
m= TE TRIONE (20.12) 
1 


Now, m= 0 is a fixed point, but the question is, is it stable or unstable? For 
tiny m (that is, 8am <1), we can Taylor-expand 


1 


1 
mea o 2 aE (20.14) 

SO 
Uc fen Eevee) (20.15) 
= 976m. (20.16) 


For small m, m either grows or decays exponentially under this mapping, 
depending on whether 028 is greater than or less than 1. The fixed point 
m = 0 is stable if 

of < 1/8 


and unstable otherwise. [Incidentally, this derivation shows that this result is 
general, holding for any true probability distribution P(x1) having variance 
o?, not just the Gaussian.] 

If o? > 1/6 then there is a bifurcation and there are two stable fixed points 
surrounding the unstable fixed point at m = 0. To illustrate this bifurcation, 
figure 20.10 shows the outcome of running the soft K-means algorithm with 
8 = 1 on one-dimensional data with standard deviation o, for various values of 
cı. Figure 20.11 shows this pitchfork bifurcation from the other point of view, 
where the data’s standard deviation cı is fixed and the algorithm’s lengthscale 
g = 1/8"? is varied on the horizontal axis. 


(20.17) 
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Figure 20.9. Schematic diagram of 
the bifurcation as the largest data 
variance g1 increases from below 
1/8"? to above 1/3'/?. The data 
variance is indicated by the 
ellipse. 
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Figure 20.10. The stable mean 
locations as a function of g1, for 
constant 3, found numerically 
(thick lines), and the 
approximation (20.22) (thin lines). 
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Figure 20.11. The stable mean 
locations as a function of 1/82, 
for constant g1. 
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Here is a cheap theory to model how the fitted parameters +m behave beyond 
the bifurcation, based on continuing the series expansion. This continuation 
of the series is rather suspect, since the series isn’t necessarily expected to 
converge beyond the bifurcation point, but the theory fits well anyway. 


We take our analytic approach one term further in the expansion 


1 


~ 1 > 1 ae) 
meaa a eg eee (20.18) 


then we can solve for the shape of the bifurcation to leading order, which 
depends on the fourth moment of the distribution: 


l2 


m fan P(a1)ai1(1 + Bma, — $ (ma?) (20.19) 


o? Bm — 3(3m)°30$, (20.20) 


[At (20.20) we use the fact that P(21) is Gaussian to find the fourth moment.] 
This map has a fixed point at m such that 


0768(1 — (Bm)?o?) = 1, (20.21) 


i.e., 





(026 - 1)” 
o?6B 
The thin line in figure 20.10 shows this theoretical approximation. Figure 20.10 


shows the bifurcation as a function of gı for fixed 8; figure 20.11 shows the 
bifurcation as a function of 1/61/? for fixed 04. 


m = +871? (20.22) 


> Exercise 20.5.12» P-292] Why does the pitchfork in figure 20.11 tend to the val- 
ues ~ 0.8 as 1/ 6+? — 0? Give an analytic expression for this asymp- 
tote. 





Solution to exercise 20.5 (p.292). The asymptote is the mean of the rectified 


Gaussian, 
Jy Normal(x, 1)x dx 


1/2 





= 2/7 ~ 0.798. (20.23) 
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Exact Inference by Complete 
Enumeration 


We open our toolbox of methods for handling probabilities by discussing a 
brute-force inference method: complete enumeration of all hypotheses, and 
evaluation of their probabilities. This approach is an exact method, and the 
difficulty of carrying it out will motivate the smarter exact and approximate 
methods introduced in the following chapters. 


> 21.1 The burglar alarm 


Bayesian probability theory is sometimes called ‘common sense, amplified’. 
When thinking about the following questions, please ask your common sense 
what it thinks the answers are; we will then see how Bayesian methods confirm 
your everyday intuition. 


Earthquake Burglar 


O O 
S N 
O 


Example 21.1. Fred lives in Los Angeles and commutes 60 miles to work. © Alarm 
Whilst at work, he receives a phone-call from his neighbour saying that Radio 
Fred’s burglar alarm is ringing. What is the probability that there was O 
a burglar in his house today? While driving home to investigate, Fred Phonecall 


hears on the radio that there was a small earthquake that day near his 
home. ‘Oh’, he says, feeling relieved, ‘it was probably the earthquake 
that set off the alarm’. What is the probability that there was a burglar 
in his house? (After Pearl, 1988). 


Figure 21.1. Belief network for the 
burglar alarm problem. 


Let’s introduce variables b (a burglar was present in Fred’s house today), 
a (the alarm is ringing), p (Fred receives a phonecall from the neighbour re- 
porting the alarm), e (a small earthquake took place today near Fred’s house), 
and r (the radio report of earthquake is heard by Fred). The probability of 
all these variables might factorize as follows: 


P(b,e,a,p,r) = P(b)P(e)P(a|b,e)P(p|a)P(r |e), (21.1) 
and plausible values for the probabilities are: 
1. Burglar probability: 
P(b=1)= 8, P(b=0) =1-8, (21.2) 
e.g., 6 = 0.001 gives a mean burglary rate of once every three years. 
2. Earthquake probability: 


P(e=1) =e, P(e=0)=1-e, (21.3) 
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with, e.g., € = 0.001; our assertion that the earthquakes are independent 
of burglars, i.e., the prior probability of b and e is P(b,e) = P(b)P(e), 
seems reasonable unless we take into account opportunistic burglars who 
strike immediately after earthquakes. 


3. Alarm ringing probability: we assume the alarm will ring if any of the 


following three events happens: (a) a burglar enters the house, and trig- 
gers the alarm (let’s assume the alarm has a reliability of a, = 0.99, i.e., 
99% of burglars trigger the alarm); (b) an earthquake takes place, and 
triggers the alarm (perhaps a, = 1% of alarms are triggered by earth- 
quakes?); or (c) some other event causes a false alarm; let’s assume the 
false alarm rate f is 0.001, so Fred has false alarms from non-earthquake 
causes once every three years. [This type of dependence of a on b and e 
is known as a ‘noisy-or’.| The probabilities of a given b and e are then: 





























































































































P(a=0|b=0, e=0) (1— f), P(a=1|b=0, e=0) f 
P(a=0|b=1, e=0) (1 — f)(1 — a), P(a=1|b=1, e=0) 
P(a=0|b=0, e=1) (1 — f)(1— ae), P(a=1|b=0, e=1) 
P(a=0|b=1, e=1) (1— f)\(1— a)(1 — ae), P(a=1|b=1, e=1) 1- (1 
or, in numbers, 
P(a=0|b=0, e=0) 0.999, P(a=1|b=0, e=0) 0.001 
P(a=0|b=1, e=0) 0.009 99, P(a | 1, e=0) 0.990 01 
P(a=0|b=0, e=1) 0.989 01, P(a=1|b=0, e=1) 0.010 99 
P(a=0|b=1, e=1) 0.009 8901, Pia | 1, e=1) 0.990 109 9. 
We assume the neighbour would never phone if the alarm is not ringing 
[P(p=1|a=0) = OJ]; and that the radio is a trustworthy reporter too 
[P(r =1|e=0) = 0]; we won’t need to specify the probabilities P(p=1|a=1) 
or P(r=1|e=1) in order to answer the questions above, since the outcomes 
p= 1 and r=1 give us certainty respectively that a=1 and e=1. 
We can answer the two questions about the burglar by computing the 
posterior probabilities of all hypotheses given the available information. Let’s 
start by reminding ourselves that the probability that there is a burglar, before 
either p or r is observed, is P(b=1) = 8 = 0.001, and the probability that an 
earthquake took place is P(e=1) = e = 0.001, and these two propositions are 
independent. 
First, when p=1, we know that the alarm is ringing: a=1. The posterior 
probability of b and e becomes: 
P(a=1|b,e)P(b)P(e) 
P(b =1) = —_——_. 21.4 
(b,¢|a=1) PET (21.4) 
The numerator’s four possible values are 
P(a=1|b=0, e=0) x P(b=0) x P(e=0) = 0.001 x 0.999 x 0.999 = 0.000998 
P(a=1|b=1, e=0) x P(b=1) x P(e=0) = 0.99001 x0.001 x 0.999 = 0.000989 
P(a=1|b=0, e=1) x P(b=0) x P(e=1) = 0.01099 x0.999x 0.001 = 0.000010979 
P(a=1|b=1, e=1) x P(b=1) x P(e=1) = 0.9901099 x 0.001 x 0.001 = 9.9 x 1077. 
The normalizing constant is the sum of these four numbers, P(a=1) = 0.002, 
and the posterior probabilities are 
P(b=0, e=0|a=1) 0.4993 
P(b=1, e=0|a=1) 0.4947 
P(b=0, e=1| 1) 0.0055 Gin) 
P(b=1, e | 1) 0.0005. 
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To answer the question, ‘what’s the probability a burglar was there?’ we 
marginalize over the earthquake variable e: 





























P(b=0|a=1) = P(b=0,e=0|a=1)+ P(b=0, e=1/a=1) = 0.505 
P(b=1|a=1) P(b=1, e=0|a=1) + P(b=1, e=1|a=1) 0.495. 
(21.6) 


So there is nearly a 50% chance that there was a burglar present. It is impor- 
tant to note that the variables b and e, which were independent a priori, are 
now dependent. The posterior distribution (21.5) is not a separable function of 
b and e. This fact is illustrated most simply by studying the effect of learning 
that e = 1. 

When we learn e=1, the posterior probability of b is given by 
P(bl|e=1, a=1) = P(b,e=1|a=1)/P(e=1|a=1), i.e., by dividing the bot- 
tom two rows of (21.5), by their sum P(e=1|a=1) = 0.0060. The posterior 
probability of b is: 











1) 0.92 
1) 0.08. 





P(b=0|e=1, a 
P(b=1le=1,a 


(21.7) 








9 


There is thus now an 8% chance that a burglar was in Fred’s house. It is 
in accordance with everyday intuition that the probability that b=1 (a pos- 
sible cause of the alarm) reduces when Fred learns that an earthquake, an 
alternative explanation of the alarm, has happened. 


Explaining away 


This phenomenon, that one of the possible causes (b=1) of some data (the 
data in this case being a=1) becomes less probable when another of the causes 
(e=1) becomes more probable, even though those two causes were indepen- 
dent variables a priori, is known as explaining away. Explaining away is an 
important feature of correct inferences, and one that any artificial intelligence 
should replicate. 

If we believe that the neighbour and the radio service are unreliable or 
capricious, so that we are not certain that the alarm really is ringing or that 
an earthquake really has happened, the calculations become more complex, 
but the explaining-away effect persists; the arrival of the earthquake report r 
simultaneously makes it more probable that the alarm truly is ringing, and 
less probable that the burglar was present. 

In summary, we solved the inference questions about the burglar by enu- 
merating all four hypotheses about the variables (b, e), finding their posterior 
probabilities, and marginalizing to obtain the required inferences about b. 


> Exercise 21.2.!] After Fred receives the phone-call about the burglar alarm, 
but before he hears the radio report, what, from his point of view, is the 
probability that there was a small earthquake today? 


»> 21.2 Exact inference for continuous hypothesis spaces 


Many of the hypothesis spaces we will consider are naturally thought of as 
continuous. For example, the unknown decay length A of section 3.1 (p.48) 
lives in a continuous one-dimensional space; and the unknown mean and stan- 
dard deviation of a Gaussian pu, 0 live in a continuous two-dimensional space. 
In any practical computer implementation, such continuous spaces will neces- 
sarily be discretized, however, and so can, in principle, be enumerated — at a 
grid of parameter values, for example. In figure 3.2 we plotted the likelihood 
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Figure 21.2. Enumeration of an 
entire (discretized) hypothesis 
space for one Gaussian with 
parameters yu (horizontal axis) 
and o (vertical). 




































































function for the decay length as a function of by evaluating the likelihood 
at a finely-spaced series of points. 


A two-parameter model 


Let’s look at the Gaussian distribution as an example of a model with a two- 
dimensional hypothesis space. The one-dimensional Gaussian distribution is 
parameterized by a mean yw and a standard deviation a: 





poe qh 
P(«|p,o) = = exp ( =) = Normal(s; p, o°). (21.8) X y 


Figure 21.2 shows an enumeration of one hundred hypotheses about the mean ; . F 
ae l . , A . Figure 21.3. Five datapoints 

and standard deviation of a one-dimensional Gaussian distribution. These {2n}5_1. The horizontal 
hypotheses are evenly spaced in a ten by ten square grid covering ten values coordinate is the value of the 
of u and ten values of ø. Each hypothesis is represented by a picture showing datum, £n; the vertical coordinate 
the probability density that it puts on x. We now examine the inference of u has no meaning. 
and o given data points £n, n = 1,..., N, assumed to be drawn independently 
from this density. 

Imagine that we acquire data, for example the five points shown in fig- 
ure 21.3. We can now evaluate the posterior probability of each of the one 
hundred subhypotheses by evaluating the likelihood of each, that is, the value 
of P({%n}2_,|u,0). The likelihood values are shown diagrammatically in 
figure 21.4 using the line thickness to encode the value of the likelihood. Sub- 
hypotheses with likelihood smaller than e78 times the maximum likelihood 
have been deleted. 

Using a finer grid, we can represent the same information by plotting the 
likelihood as a surface plot or contour plot as a function of u and o (figure 21.5). 


A five-parameter mixture model 


Eyeballing the data (figure 21.3), you might agree that it seems more plau- 
sible that they come not from a single Gaussian but from a mixture of two 
Gaussians, defined by two means, two standard deviations, and two mixing 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 


You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


21.2: Exact inference for continuous hypothesis spaces 






























































0.06 p 
, 
0.05 F + 09 
0.04 | | 08 
0.03 F 0.7 
4 06 
0.02 + He 
0.01 F IN. 4+ 04 
j : 
HN 
40 i 4 03 
UN 
0.8 UUM + o2 
o ESI fi 1 L 0.1 
sigma 4 3 0 0.5 1 1.5 2 
0.2 


sigma 





297 


Figure 21.4. Likelihood function, 
given the data of figure 21.3, 
represented by line thickness. 
Subhypotheses having likelihood 
smaller than e78 times the 
maximum likelihood are not 
shown. 


Figure 21.5. The likelihood 
function for the parameters of a 
Gaussian distribution. 

Surface plot and contour plot of 
the log likelihood as a function of 
u and o. The data set of N = 5 
points had mean 7 = 1.0 and 


S= (x — 7)? = 1.0. 
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Figure 21.6. Enumeration of the 
ae EN Rear entire (discretized) hypothesis 
























































=~ space for a mixture of two 
a at Ae ees A ee eS Bote A Gaussians. Weight of the mixture 
= components is 711, T2 = 0.6,0.4 in 
Ta, i wet the top half and 0.8, 0.2 in the 
RA AWN anrK AAAA KR KRAAMN RA AAA rere nr r_A bottom half. Means ji and u2 
Ke vary horizontally, and standard 
Re deviations gı and o2 vary 
A A A Aa N tically. 
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coefficients mı and m2, satisfying mı + m2 = 1, m; > 0. 
eee | ( | 
P , , , -a (- > 
(a | 1,01, T1, H2, 02, 72) ers exp | -537 TA exp ( -5 


Let’s enumerate the subhypotheses for this alternative model. The parameter 
space is five-dimensional, so it becomes challenging to represent it on a single 
page. Figure 21.6 enumerates 800 subhypotheses with different values of the 
five parameters 41, H2,01, 02,71. The means are varied between five values 
each in the horizontal directions. The standard deviations take on four values 
each vertically. And 7, takes on two values vertically. We can represent the 
inference about these five parameters in the light of the five datapoints as 
shown in figure 21.7. 

If we wish to compare the one-Gaussian model with the mixture-of-two 
model, we can find the models’ posterior probabilities by evaluating the 
marginal likelihood or evidence for each model H, P({x}|H). The evidence 
is given by integrating over the parameters, 0; the integration can be imple- 
mented numerically by summing over the alternative enumerated values of 


0 


3 


P({2} |H) = $ P(0)PHz}10, H), (21.9) 
0 


where P(@) is the prior distribution over the grid of parameter values, which 
I take to be uniform. 

For the mixture of two Gaussians this integral is a five-dimensional integral; 
if it is to be performed at all accurately, the grid of points will need to be 
much finer than the grids shown in the figures. If the uncertainty about each 
of K parameters has been reduced by, say, a factor of ten by observing the 
data, then brute-force integration requires a grid of at least 10% points. This 
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exponential growth of computation with model size is the reason why complete 
enumeration is rarely a feasible computational strategy. 


Exercise 21.3.4] Imagine fitting a mixture of ten Gaussians to data in a 
= twenty-dimensional space. Estimate the computational cost of imple- 
menting inferences for this model by enumeration of a grid of parameter 

values. 
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Figure 21.7. Inferring a mixture of 
two Gaussians. Likelihood 
function, given the data of 

figure 21.3, represented by line 
thickness. The hypothesis space is 
identical to that shown in 

figure 21.6. Subhypotheses having 
likelihood smaller than e78 times 
the maximum likelihood are not 
shown, hence the blank regions, 
which correspond to hypotheses 
that the data have ruled out. 
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Maximum Likelihood and Clustering 


Rather than enumerate all hypotheses — which may be exponential in number 
— we can save a lot of time by homing in on one good hypothesis that fits 
the data well. This is the philosophy behind the maximum likelihood method, 
which identifies the setting of the parameter vector @ that maximizes the 
likelihood, P(Data |0, H). 

For some models the maximum likelihood parameters can be identified 
instantly from the data; for more complex models, finding the maximum like- 
lihood parameters may require an iterative algorithm. 

For any model, it is usually easiest to work with the logarithm of the 
likelihood rather than the likelihood, since likelihoods, being products of the 
probabilities of many data points, tend to be very small. Likelihoods multiply; 
log likelihoods add. 


> 22.1 Maximum likelihood for one Gaussian 


We return to the Gaussian for our first examples. Assume we have data 
{xy }4_,. The log likelihood is: 


In Ph{2n} almo) = —NIn(V2mo) — (2n — u)? /(20°). (22.1) 


n 


The likelihood can be expressed in terms of two functions of the data, the 
sample mean 


N 
z=)>2,/N, (22.2) 
n=1 
and the sum of square deviations 


S=) (an - 2): (22.3) 


In P({en}Ng |m, 0) = —NIn(V2ma) — [N(w — 8)? + S]/(20?). (224) 


Because the likelihood depends on the data only through z and S, these two 
quantities are known as sufficient statistics. 


Example 22.1. Differentiate the log likelihood with respect to u and show that, 
if the standard deviation is known to be ø, the maximum likelihood mean 
p of a Gaussian is equal to the sample mean Z, for any value of ø. 





Solution. 
a N(u- 3) 
—lnP = —-— 22. 
Ou i o? (22:3) 
= 0 when p=z. O (22.6) 
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g Figure 22.1. The likelihood 

J o9 function for the parameters of a 
J 08 Gaussian distribution. 
J 07 (al, a2) Surface plot and contour 
T 2% igma Plot of the log likelihood as a 
alia function of u and a. The data set 
| is of N = 5 points had mean % = 1.0 
ip: and S = Ð (z — 7)? = 1.0. 








a (b) The posterior probability of u 
2 for various values of ø. 

(c) The posterior probability of øo 
for various fixed values of u 
(shown as a density over Ing). 
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If we Taylor-expand the log likelihood about the maximum, we can de- 
fine approximate error bars on the maximum likelihood parameter: we use 
a quadratic approximation to estimate how far from the maximum-likelihood 
parameter setting we can go before the likelihood falls by some standard fac- 
tor, for example e!/?, or e*/2. In the special case of a likelihood that is a 
Gaussian function of the parameters, the quadratic approximation is exact. 


Example 22.2. Find the second derivative of the log likelihood with respect to 
p, and find the error bars on u, given the data and ø. 


Solution. 





—InP=——. (22.7) 











Comparing this curvature with the curvature of the log of a Gaussian distri- 
bution over u of standard deviation o, exp(—p?/(202)), which is —1/02, we 
can deduce that the error bars on u (derived from the likelihood function) are 


ae (22.8) 
Oo, = 4 . 

"O YN 
The error bars have this property: at the two points u = z0 „, the likelihood 
is smaller than its maximum value by a factor of e!/?. 





Example 22.3. Find the maximum likelihood standard deviation o of a Gaus- 
sian, whose mean is known to be u, in the light of data tra ane Find 
the second derivative of the log likelihood with respect to Ing, and error 
bars on Ing. 


Solution. The likelihood’s dependence on ø is 


In P({an}_, |u, 0) = -N In(v2r0) — ite (22.9) 


where Stop = 30, (fn — u)?. To find the maximum of the likelihood, we can 
differentiate with respect to Ing. [It’s often most hygienic to differentiate with 
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respect to ln u rather than u, when u is a scale variable; we use du” /d(ln u) = 








nu”.] 
oln P({an brea | 4,0) Stot 
pe nee ee oN 22.10 
ðlno i o2 ( ) 
This derivative is zero when g 
2 tot 
a 22.11 
pan (22.11) 
i.e., 
Daien 1)? 
=4 [E 22.12 
o k (22.12) 
The second derivative is 
0? In Panti | Hse) Stot 
P A TEE a ale 22.13 
O(In o)? o? ( ) 


and at the maximum-likelihood value of o?, this equals —2N. So error bars 


on Ino are i 


Ohno =o: 
ng aN 


> Exercise 22.4.!4] Show that the values of p and Ino that jointly maximize the 
likelihood are: {u, o}mr = {z,o = VSN} , where 











(22.14) 





— ,/ Enn — 2)? 
on = — en (22.15) 


> 22.2 Maximum likelihood for a mixture of Gaussians 


We now derive an algorithm for fitting a mixture of Gaussians to one- 
dimensional data. In fact, this algorithm is so important to understand that, 
you, gentle reader, get to derive the algorithm. Please work through the fol- 
lowing exercise. 


Exercise 22.5.1% P-310] A random variable x is assumed to have a probability 
> distribution that is a mixture of two Gaussians, 


ee _(z- 1)? 
P(x | u1, 2,0) = Sa E) ; (22.16) 


where the two Gaussians are given the labels k = 1 and k = 2; the prior 
probability of the class label k is {p1 = 1/2, p2=1/2}; {uk} are the means 
of the two Gaussians; and both have standard deviation ø. For brevity, we 
denote these parameters by 0 = {{ uk}, o}. 

A data set consists of N points {x,}‘_, which are assumed to be indepen- 
dent samples from this distribution. Let kn denote the unknown class label of 
the nth point. 

Assuming that {up} and o are known, show that the posterior probability 
of the class label kn of the nth point can be written as 














1 
Plkn=1|an,0) = 
( | ) 1 + exp|-(w1£n + wo)] 
i (22.17) 
Pk, =2 n0) = : 
( | ) 1 + exp|+(w1£n + wo)] 
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and give expressions for wı and wo. 


Assume now that the means {j,} are not known, and that we wish to 
infer them from the data {x,}4_,. (The standard deviation ø is known.) In 
the remainder of this question we will derive an iterative algorithm for finding 
values for {up} that maximize the likelihood, 


P({2n}nai l {urh o) = | [Plen] a} o). (22.18) 


Let L denote the natural log of the likelihood. Show that the derivative of the 
log likelihood with respect to pz is given by 


ð (En — Hk) 
eng = DPR (22.19) 


where Phin = P(kn =k | 2n,@) appeared above at equation (22.17). 
Show, neglecting terms in oP (kn =k| ay, 0), that the second derivative 
is approximately given by 


oF 1 
aaa’ = -J Paina: (22.20) 
n 
Hence show that from an initial state u1, 42, an approximate Newton—Raphson 
step updates these parameters to u4, 45, where 


ul Ee D Pk|ntn 
: Dn Pk\n 


[The Newton-Raphson method for maximizing L(j:) updates u to p’ = pw — 


[se / Fe], 


(22.21) 














Assuming that ø = 1, sketch a contour plot of the likelihood function as a 
function of pı and u2 for the data set shown above. The data set consists of 
32 points. Describe the peaks in your sketch and indicate their widths. 


Notice that the algorithm you have derived for maximizing the likelihood 
is identical to the soft K-means algorithm of section 20.4. Now that it is clear 
that clustering can be viewed as mixture-density-modelling, we are able to 
derive enhancements to the K-means algorithm, which rectify the problems 
we noted earlier. 


»> 22.3 Enhancements to soft K-means 


Algorithm 22.2 shows a version of the soft-K-means algorithm corresponding 
to a modelling assumption that each cluster is a spherical Gaussian having its 
own width (each cluster has its own 3“) = 1/ o2). The algorithm updates the 
lengthscales ø% for itself. The algorithm also includes cluster weight parame- 
ters 71,72,...,7K which also update themselves, allowing accurate modelling 
of data from clusters of unequal weights. This algorithm is demonstrated in 
figure 22.3 for two data sets that we’ve seen before. The second example shows 
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5 She Algorithm 22.2. The soft K-means 
Assignment step. The responsibilities are algorithm, version 2. 


1 
1 
Tk Jiro P È 


1 , 
E ER (k) y(n) 
Do "Ramage OP ( Op een ) 


where J is the dimensionality of x. 


Update step. Each cluster’s parameters, m), mp, and cre are adjusted 
to match the data points that it is responsible for. 


5 rx 


where RC) is the total responsibility of mean k, 


RY = 5 r, 





t=0 t=1 t=2 t=3 t=9 Figure 22.3. Soft K-means 
algorithm, with K = 2, applied 
ĉe | (a) to the 40-point data set of 
>N | figure 20.3; (b) to the little ’n’ 
large data set of figure 20.5. 
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Algorithm 22.4. The soft K-means 
algorithm, version 3, which 
corresponds to a model of 
axis-aligned Gaussians. 


T 
1 (k) h2 (k)\2 
nk Tep = (m; = Ti ) 2(0; ) 
= Tesi V Iro ( 3 / 


i=1 
Xy (numerator, with k’ in place of k) 


(22.27) 


ay 


(22.28) 





Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 



























































22.4: A fatal flaw of maximum likelihood 305 
t=10 t = 20 t= 30 Figure 22.5. Soft K-means 
algorithm, version 3, applied to 
Ox the data consisting of two 
a. 2x cigar-shaped clusters. K = 2 (cf. 
a y figure 20.6). 
x 
t=0 t= 10 t = 20 t = 26 t = 32 Figure 22.6. Soft K-means 
=e — n% ane y% algorithm, version 3, applied to 
aT $ 99 Be 88 S os the little ’n’ large data set. K = 2. 
EN len eee | cee. || ees 
GD a || ED || Raster || LES ae | LEY a 
% op 6 Og awh as eF o 



































that convergence can take a long time, but eventually the algorithm identifies 
the small cluster and the large cluster. 
Soft K-means, version 2, is a maximum-likelihood algorithm for fitting a 
mixture of spherical Gaussians to data — ‘spherical’ meaning that the variance A proof that the algorithm does 
of the Gaussian is the same in all directions. This algorithm is still no good indeed maximize the likelihood is 
at modelling the cigar-shaped clusters of figure 20.6. If we wish to model the deferred to section 33.7. 
clusters by axis-aligned Gaussians with possibly-unequal variances, we replace 
the assignment rule (22.22) and the variance update rule (22.24) by the rules 
(22.27) and (22.28) displayed in algorithm 22.4. 
This third version of soft K-means is demonstrated in figure 22.5 on the 
‘two cigars’ data set of figure 20.6. After 30 iterations, the algorithm correctly 
locates the two clusters. Figure 22.6 shows the same algorithm applied to the 
little ’n’ large data set; again, the correct cluster locations are found. 


> 22.4 A fatal flaw of maximum likelihood 


Finally, figure 22.7 sounds a cautionary note: when we fit K = 4 means to our 
first toy data set, we sometimes find that very small clusters form, covering 
just one or two data points. This is a pathological property of soft K-means 
clustering, versions 2 and 3. 


> Exercise 22.6,|?1 Investigate what happens if one mean m(*) sits exactly on 


top of one data point; show that if the variance o? is sufficiently small, 


then no return is possible: o? becomes ever smaller. 





t=0 t=5 t=10 t = 20 Figure 22.7. Soft K-means 
algorithm applied to a data set of 
ls s y 40 points. K = 4. Notice that at 
x * ome convergence, one very small 
7 x 
= (Ap g O G cluster has formed between two 
{S} ok D | Ax data points. 
X x o | 
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KABOOM! 


Soft K-means can blow up. Put one cluster exactly on one data point and let its 
variance go to zero — you can obtain an arbitrarily large likelihood! Maximum 
likelihood methods can break down by finding highly tuned models that fit part 
of the data perfectly. This phenomenon is known as overfitting. The reason 
we are not interested in these solutions with enormous likelihood is this: sure, 
these parameter-settings may have enormous posterior probability density, 
but the density is large over only a very small volume of parameter space. So 
the probability mass associated with these likelihood spikes is usually tiny. 

We conclude that maximum likelihood methods are not a satisfactory gen- 
eral solution to data-modelling problems: the likelihood may be infinitely large 
at certain parameter settings. Even if the likelihood does not have infinitely- 
large spikes, the maximum of the likelihood is often unrepresentative, in high- 
dimensional problems. 

Even in low-dimensional problems, maximum likelihood solutions can be 
unrepresentative. As you may know from basic statistics, the maximum like- 
lihood estimator (22.15) for a Gaussian’s standard deviation, oy, is a biased 
estimator, a topic that we’ll take up in Chapter 24. 


The maximum a posteriori (MAP) method 


A popular replacement for maximizing the likelihood is maximizing the 
Bayesian posterior probability density of the parameters instead. However, 
multiplying the likelihood by a prior and maximizing the posterior does 
not make the above problems go away; the posterior density often also has 
infinitely-large spikes, and the maximum of the posterior probability density 
is often unrepresentative of the whole posterior distribution. Think back to 
the concept of typicality, which we encountered in Chapter 4: in high dimen- 
sions, most of the probability mass is in a typical set whose properties are 
quite different from the points that have the maximum probability density. 
Maxima are atypical. 

A further reason for disliking the maximum a posteriori is that it is basis- 
dependent. If we make a nonlinear change of basis from the parameter 6 to 
the parameter u = f(@) then the probability density of 0 is transformed to 


oo 


P(u) = PO) |= 


(22.29) 








The maximum of the density P(u) will usually not coincide with the maximum 
of the density P(@). (For figures illustrating such nonlinear changes of basis, 
see the next chapter.) It seems undesirable to use a method whose answers 
change when we change representation. 


Further reading 
The soft K-means algorithm is at the heart of the automatic classification 
package, AutoClass (Hanson et al., 1991b; Hanson et al., 1991a). 
> 22.5 Further exercises 
Exercises where maximum likelihood may be useful 


Exercise 22.7.19] Make a version of the K-means algorithm that models the 
data as a mixture of K arbitrary Gaussians, i.e., Gaussians that are not 
constrained to be axis-aligned. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


22.5: Further exercises 307 


> Exercise 22.8.[?1 (a) A photon counter is pointed at a remote star for one 
minute, in order to infer the brightness, i.e., the rate of photons 
arriving at the counter per minute, A. Assuming the number of 
photons collected r has a Poisson distribution with mean A, 


r 


P(r|A) = ere 


> (22.30) 


what is the maximum likelihood estimate for A, given r = 9? Find 
error bars on In À. 


(b) Same situation, but now we assume that the counter detects not 
only photons from the star but also ‘background’ photons. The 
background rate of photons is known to be b=13 photons per 
minute. We assume the number of photons collected, r, has a Pois- 
son distribution with mean A+b. Now, given r= 9 detected photons, 
what is the maximum likelihood estimate for A? Comment on this 
answer, discussing also the Bayesian posterior distribution, and the 


‘unbiased estimator’ of sampling theory, ÀA = r — b. 


Exercise 22.9.!7] A bent coin is tossed N times, giving Na heads and N, tails. 
Assume a beta distribution prior for the probability of heads, p, for 
example the uniform distribution. Find the maximum likelihood and 
maximum a posteriori values of p, then find the maximum likelihood 
and maximum a posteriori values of the logit a = In[p/(1—p)]. Compare 
with the predictive distribution, i.e., the probability that the next toss 
will come up heads. 


> Exercise 22.10.!7] Two men looked through prison bars; one saw stars, the 
other tried to infer where the window frame was. (Emax, Ymax) 


From the other side of a room, you look through a window and see stars 
at locations {(£n, Yn)}. You can’t see the window edges because it is to- 
tally dark apart from the stars. Assuming the window is rectangular and 
that the visible stars’ locations are independently randomly distributed, 
what are the inferred values of (£min, Ymin, Tmax, Ymax), according to 
maximum likelihood? Sketch the likelihood as a function of £max, for 
fixed Tmin; Ymin> and Ymax: 





(amin, Ymin) 


> Exercise 22.11.[91 A sailor infers his location (x,y) by measuring the bearings 
of three buoys whose locations (£n, Yn) are given on his chart. Let the (x3, ys) 
true bearings of the buoys be 6,. Assuming that his measurement 6, of 
each bearing is subject to Gaussian noise of small standard deviation ø, 
what is his inferred location, by maximum likelihood? 


The sailor’s rule of thumb says that the boat’s position can be taken to 
be the centre of the cocked hat, the triangle produced by the intersection 
of the three measured bearings (figure 22.8). Can you persuade him that 
the maximum likelihood answer is better? 


(v1, 491) 
(x2, y2) 


> Exercise 22.12.19 P-310] Maximum likelihood fitting of an exponential-family Figure 22.8. The standard way of 
model drawing three slightly inconsistent 


bearings on a chart produces a 
Assume that a variable x comes from a probability distribution of the triangle called a cocked hat. 
form Where is the sailor? 


P(x|w) = ney? bs uf) ; (22.31) 
k 
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where the functions f(x) are given, and the parameters w = {wp} are 
not known. A data set {x} of N points is supplied. 


Show by differentiating the log likelihood that the maximum-likelihood 
parameters Wy, satisfy 


E P| woes) Fe) =E Do fee), (22.32) 


where the left-hand sum is over all x, and the right-hand sum is over the 
data points. A shorthand for this result is that each function-average 
under the fitted model must equal the function-average found in the 
data: 


(fie) P(x | waa.) = (fk) Data i (22.33) 


> Exercise 22.13.12] ‘Maximum entropy’ fitting of models to constraints. 


When confronted by a probability distribution P(x) about which only a 
few facts are known, the maximum entropy principle (maxent) offers a 
rule for choosing a distribution that satisfies those constraints. Accord- 
ing to maxent, you should select the P(x) that maximizes the entropy 


H =X P(x) log 1/P(x), (22.34) 


subject to the constraints. Assuming the constraints assert that the 
averages of certain functions f(x) are known, i.e., 


(fk) Pœ) = Fr, (22.35) 


show, by introducing Lagrange multipliers (one for each constraint, in- 
cluding normalization), that the maximum-entropy distribution has the 
form 


P(X) Maxent = FO (= visi) 3 (22.36) 
k 


where the parameters Z and {wọ} are set such that the constraints 
(22.35) are satisfied. 


And hence the maximum entropy method gives identical results to max- 
imum likelihood fitting of an exponential-family model (previous exer- 
cise). 


The maximum entropy method has sometimes been recommended as a 
method for assigning prior distributions in Bayesian modelling. While 
the outcomes of the maximum entropy method are sometimes interesting 
and thought-provoking, I do not advocate maxent as the approach to 
assigning priors. 

Maximum entropy is also sometimes proposed as a method for solv- 
ing inference problems — for example, ‘given that the mean score of 
this unfair six-sided die is 2.5, what is its probability distribution 
(p1, P2, P3, p4, ps; Pe)?’ I think it is a bad idea to use maximum entropy 
in this way; it can give silly answers. The correct way to solve inference 
problems is to use Bayes’ theorem. 
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Exercises where maximum likelihood and MAP have difficulties 


> Exercise 22.14.!7] This exercise explores the idea that maximizing a proba- 
bility density is a poor way to find a point that is representative of 
the density. Consider a Gaussian distribution in a k-dimensional space, 
P(w) = (1/V2r ow)" exp(— 5a w?/202,). Show that nearly all of the 
probability mass of a Gaussian is in a thin shell of radius r = Vkow 
and of thickness proportional to r/ Vk. For example, in 1000 dimen- 
sions, 90% of the mass of a Gaussian with oy = 1 is in a shell of radius 
31.6 and thickness 2.8. However, the probability density at the origin is 
ek/2 ~ 10?!" times bigger than the density at this shell where most of + 
the probability mass is. 


B CD-G 


Now consider two Gaussian densities in 1000 dimensions that differ in et ety 2102 70 7 10 220 


radius ow by just 1%, and that contain equal total probability mass. 
Show that the maximum probability density is greater at the centre of Scientist Ln 
the Gaussian with smaller ow by a factor of ~exp(0.01k) ~ 20000. 


A —27.020 
In ill-posed problems, a typical posterior distribution is often a weighted B 3.570 
superposition of Gaussians with varying means and standard deviations, C 8.191 
so the true posterior has a skew peak, with the maximum of the prob- D 9.898 
ability density located near the mean of the Gaussian distribution that E 9.603 
has the smallest standard deviation, not the Gaussian with the greatest F 9.945 
weight. G 10.056 


> Exercise 22.15.!3] The seven scientists. N datapoints {£n} are drawn from Figure 22.9. Seven measurements 
N distributions, all of which are Gaussian with a common mean p but {£n} of a parameter p by seven 
with different unknown standard deviations on. What are the maximum scientists each having his own 
likelihood parameters p,{o,} given the data? For example, seven noise-level oy. 
scientists (A, B, C, D, E, F, G) with wildly-differing experimental skills 
measure u. You expect some of them to do accurate work (i.e., to have 
small op), and some of them to turn in wildly inaccurate answers (i.e., 
to have enormous gn). Figure 22.9 shows their seven results. What is 
u, and how reliable is each scientist? 


I hope you agree that, intuitively, it looks pretty certain that A and B 
are both inept measurers, that D-G are better, and that the true value 
of u is somewhere close to 10. But what does maximizing the likelihood 
tell you? 


Exercise 22.16.19] Problems with MAP method. A collection of widgets i = 
1,...,& have a property called ‘wodge’, w;, which we measure, wid- 
get by widget, in noisy experiments with a known noise level a, = 1.0. 
Our model for these quantities is that they come from a Gaussian prior 
P(w; |) = Normal(0, Ya), where a=1/c2, is not known. Our prior for 
this variance is flat over log ow from ow = 0.1 to gw = 10. 


Scenario 1. Suppose four widgets have been measured and give the fol- 
lowing data: {d1, d2, d3, d4} = {2.2, —2.2, 2.8, —2.8}. We are interested 
in inferring the wodges of these four widgets. 


(a) Find the values of w and a that maximize the posterior probability 
P(w, log a|d). 

(b) Marginalize over a and find the posterior probability density of w 
given the data. [Integration skills required. See MacKay (1999a) for 
solution.] Find maxima of P(w |d). [Answer: two maxima — one at 
Wyp = {1.8, —1.8, 2.2, —2.2}, with error bars on all four parameters 
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(obtained from Gaussian approximation to the posterior) +0.9; and 
one at wip = {0.03, —0.03, 0.04, —0.04} with error bars +0.1.] 





Scenario 2. Suppose in addition to the four measurements above we are 
now informed that there are four more widgets that have been measured 
with a much less accurate instrument, having o’, = 100.0. Thus we now 
have both well-determined and ill-determined parameters, as in a typical 
ill-posed problem. The data from these measurements were a string of 
uninformative values, {d5,dg,d7,dg} = {100, —100, 100, —100}. 


We are again asked to infer the wodges of the widgets. Intuitively, our 
inferences about the well-measured widgets should be negligibly affected 
by this vacuous information about the poorly-measured widgets. But 
what happens to the MAP method? 

















(a) Find the values of w and a that maximize the posterior probability és $ 
P(w,loga |d). © 
(b) Find maxima of P(w|d). [Answer: only one maximum, Wmp = ib 
{0.03, —0.03, 0.03, —0.03, 0.0001, —0.0001, 0.0001, —0.0001}, with 128 
error bars on all eight parameters +0.11.] J2 
»> 22.6 Solutions a A 
0 1 2 3 4 


Solution to exercise 22.5 (p.302). Figure 22.10 shows a contour plot of the 
likelihood function for the 32 data points. The peaks are pretty-near centred 
on the points (1,5) and (5,1), and are pretty-near circular in their contours. 
The width of each of the peaks is a standard deviation of ¢//16 = 1/4. The 
peaks are roughly Gaussian in shape. 


Figure 22.10. The likelihood as a 
function of u1 and pe. 


Solution to exercise 22.12 (p.307). The log likelihood is: 


In P({x™} |w) = —N In Z(w) + Ld wy f(x). (22.37) 
2 Pa |w) = -nZ In Z(w) + 2 frx (22.38) 
Dw ðw Aai ' 


Now, the fun part is what happens when we differentiate the log of the nor- 
malizing constant: 


ð 1 o 
mu ee) = ZO) 2 Bux exp (Sunt ‘ 


= Tw 2 Se (= we fre (x )) st S2 x|w)fk(x (22.39) 


so 
KETEG }|w) = a x| w) f(x) + dfx) (22.40) 
and at the maximum of the likelihood, 


dP (x | Wat) fie(X Face (22.41) 
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Useful Probability Distributions 



































In Bayesian data modelling, there’s a small collection of probability distribu- jsi 
tions that come up again and again. The purpose of this chapter is to intro- FEE 
duce these distributions so that they won’t be intimidating when encountered S] 
in combat situations. aeai | | | 
There is no need to memorize any of them, except perhaps the Gaussian; EERDER NNE 
if a distribution is important enough, it will memorize itself, and otherwise, it 
can easily be looked up. P J 
0.01 4 
0.001 4 
> 23.1 Distributions over integers 0.0001 4 
1e-05 4 
Binomial, Poisson, exponential 012345678910 
We already encountered the binomial distribution and the Poisson distribution r 
on page 2. 


; ; Ant . ; ; ; Fi 23.1. The bi ial 
The binomial distribution for an integer r with parameters f (the bias, a A Plr| TET N =10), 


f € [0,1]) and N (the number of trials) is: on a linear scale (top) and a 
logarithmic scale (bottom). 














N 
P(r| f, N) = ( \ra -fN re{0,1,2,..., N}. (23.1) 
r 
The binomial distribution arises, for example, when we flip a bent coin, oa 
with bias f, N times, and observe the number of heads, r. e 
The Poisson distribution with parameter \ > 0 is: me 
- UUE 
P(r|A)=e*— re {0,1,2,...}. (23.2) wee gee OR 
r! F 
0.1 
The Poisson distribution arises, for example, when we count the number of 0.01 
photons r that arrive in a pixel during a fixed interval, given that the mean pee 
intensity on the pixel corresponds to an average number of photons À. en | | | 
4 E 
The exponential distribution on integers,, sac 5 45 
i 


PT) SF) PET 500), (23.3) Figure 23.2. The Poisson 


here ia : : . — distribution P(r|\= 2.7), ona 
arises in waiting problems. How long will you have to wait until a six is rolled, linear scale (top) and a 


if a fair six-sided dice is rolled? Answer: the probability distribution of the logarithmic scale (bottom). 
number of rolls, r, is exponential over integers with parameter f = 5/6. The 
distribution may also be written 


P(r|f)=(4—f)e" =r €(0,1,2,...,00), (23.4) 
where A = In(1/f). 
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> 23.2 Distributions over unbounded real numbers 


Gaussian, Student, Cauchy, biexponential, inverse-cosh. 
The Gaussian distribution or normal distribution with mean pu and standard 
deviation ø is 





pee 
P(elina) = zep ( ( a x E€ (—co, 00), (23.5) 


where 


Z = Vro. (23.6) 


It is sometimes useful to work with the quantity T = 1/ø?, which is called the 
precision parameter of the Gaussian. 
A sample z from a standard univariate Gaussian can be generated by 


computing 
z = cos(2ru1)y 2ln(1/u2), (23.7) 


where u; and ug are uniformly distributed in (0,1). A second sample z2 = 
sin(2ru1)y/2ln(1/u2), independent of the first, can then be obtained for free. 

The Gaussian distribution is widely used and often asserted to be a very 
common distribution in the real world, but I am sceptical about this asser- 
tion. Yes, unimodal distributions may be common; but a Gaussian is a spe- 
cial, rather extreme, unimodal distribution. It has very light tails: the log- 
probability-density decreases quadratically. The typical deviation of x from u 
is ø, but the respective probabilities that x deviates from u by more than 20, 
30, 40, and 5a, are 0.046, 0.003, 6 x 1075, and 6 x 1077. In my experience, 
deviations from a mean four or five times greater than the typical deviation 
may be rare, but not as rare as 6 x 1075! I therefore urge caution in the use of 
Gaussian distributions: if a variable that is modelled with a Gaussian actually T 
has a heavier-tailed distribution, the rest of the model will contort itself to 0.4 
reduce the deviations of the outliers, like a sheet of paper being crushed by a 03 
rubber band. = 


0.1 





> Exercise 23.1.!4] Pick a variable that is supposedly bell-shaped in probability rie ee ee 
distribution, gather data, and make a plot of the variable’s empirical 
distribution. Show the distribution as a histogram on a log scale and 01 
investigate whether the tails are well-modelled by a Gaussian distribu- ao 
tion. [One example of a variable to study is the amplitude of an audio 
signal.| 


0.001 








One distribution with heavier tails than a Gaussian is a mixture of Gaus- 

sians. A mixture of two Gaussians, for example, is defined by two means, Figure 23.3. Three unimodal 
two standard deviations, and two mixing coefficients 7, and ms, satisfying distributions. Two Student 

TW +72 = 1, m; > 0. distributions, with parameters 

(m, s) = (1,1) (heavy line) (a 
T1 = 2 T2 = 2 ie $ 
P(x | 14,01, T1, l2, 02, T2) = exp (- (r=) )+ (z—p2) ) _ Cauchy distribution) and (2, 4) 
V2T01 1 


—_ exp|— 
V 2709 p ( 203 (light line), and a Gaussian 
distribution with mean u = 3 and 


If we take an appropriately weighted mixture of an infinite number of standard deviation ø = 3 (dashed 








Gaussians, all having mean u, we obtain a Student-t distribution, line), shown on linear vertical 
scales (top) and logarithmic 
P(a|2,8,n) = 1 1 (23.8) vertical scales (bottom). Notice 


that the heavy tails of the Cauchy 
distribution are scarcely evident 
where in the upper ‘bell-shaped curve’. 


a _P(n/2) 
Z= H + 1/2) (23.9) 


ZF @ = p)?/(ns?) Or” 
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and n is called the number of degrees of freedom and F is the gamma function. 
If n > 1 then the Student distribution (23.8) has a mean and that mean is 
u. If n > 2 the distribution also has a finite variance, ¢? = ns?/(n — 2). 
As n — œ, the Student distribution approaches the normal distribution with 
mean u and standard deviation s. The Student distribution arises both in 
classical statistics (as the sampling-theoretic distribution of certain statistics) 
and in Bayesian inference (as the probability distribution of a variable coming 
from a Gaussian distribution whose standard deviation we aren’t sure of). 

In the special case n = 1, the Student distribution is called the Cauchy 
distribution. 


A distribution whose tails are intermediate in heaviness between Student 
and Gaussian is the biexponential distribution, 


P(x|u,s) = Zop (-=) x E (—c, 00) (23.10) 


where 
Z = 2s. (23.11) 


The inverse-cosh distribution 


1 


Pele) Tanp 


(23:12) 
is a popular model in independent component analysis. In the limit of large 8, 
the probability distribution P(a |G) becomes a biexponential distribution. In 
the limit 6 — 0 P(a|) approaches a Gaussian with mean zero and variance 


1/6. 


»> 23.3 Distributions over positive real numbers 


Exponential, gamma, inverse-gamma, and log-normal. 
The exponential distribution, 


P(a|s) = Z oxp (-=) x € (0,00), (23.13) 


where 
Z =s, (23.14) 


arises in waiting problems. How long will you have to wait for a bus in Pois- 
sonville, given that buses arrive independently at random with one every s 
minutes on average? Answer: the probability distribution of your wait, x, is 
exponential with mean s. 


The gamma distribution is like a Gaussian distribution, except whereas the 
Gaussian goes from —co to oo, gamma distributions go from 0 to oo. Just as 
the Gaussian distribution has two parameters u and o which control the mean 
and width of the distribution, the gamma distribution has two parameters. It 
is the product of the one-parameter exponential distribution (23.13) with a 
polynomial, 2°~!. The exponent c in the polynomial is the second parameter. 


1 fa\c-1 x 
P(x|s,c) = Titse) = = (=) exp (-=) , O<a%<ow (23.15) 


where 
Z =T(e)s. (23.16) 
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A 08 Figure 23.4. Two gamma 
0.8 0.6 distributions, with parameters 
oe ne (s,c) = (1,3) (heavy lines) and 
03 oe 10, 0.3 (light lines), shown on 
Si 01 linear vertical scales (top) and 
2 4 6 8 10 4 2 0 2 4 logarithmic vertical scales 


(bottom); and shown as a 
function of x on the left (23.15) 


o1 4 / 0.1 and l = Inz on the right (23.18). 
0.01 0.01 


0.001 0.001 


0.0001 + T T T f 1 0.0001 T T T T T 








This is a simple peaked distribution with mean sc and variance s?c. 


It is often natural to represent a positive real variable x in terms of its 
logarithm l = ln z. The probability density of l is 








P(l) = P(a«(l)) sd = Piae (23.17) 
_ 1/2, (20) 
= z ( - ) »( > Ji (23.18) 
where 
Z = T(o). (23.19) 


[The gamma distribution is named after its normalizing constant — an odd 
convention, it seems to me!] 

Figure 23.4 shows a couple of gamma distributions as a function of x and 
of l. Notice that where the original gamma distribution (23.15) may have a 
‘spike’ at x = 0, the distribution over l never has such a spike. The spike is 
an artefact of a bad choice of basis. 

In the limit sc = 1,c — 0, we obtain the noninformative prior for a scale 
parameter, the 1/z prior. This improper prior is called noninformative because 
it has no associated length scale, no characteristic value of x, so it prefers all 
values of x equally. It is invariant under the reparameterization x = ma. If 
we transform the 1/z probability density into a density over | = ln x we find 
the latter density is uniform. 


> Exercise 23.2.!11 Imagine that we reparameterize a positive variable x in terms 
of its cube root, u = x!/3. If the probability density of x is the improper 
distribution 1/z, what is the probability density of u? 


The gamma distribution is always a unimodal density over | = ln z, and, 
as can be seen in the figures, it is asymmetric. If z has a gamma distribution, 
and we decide to work in terms of the inverse of x, v = 1/x, we obtain a new 
distribution, in which the density over l is flipped left-for-right: the probability 
density of v is called an inverse-gamma distribution, 


1 cyt 1 
WIS 7. ( ) exp ( =). 0 v <0 ( 3 0) 


SU 


where 
Zy =T (¢)/s. (23.21) 
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25 oe Figure 23.5. Two inverse gamma 
aor | 06 distributions, with parameters 
oi ne (s,c) = (1,3) (heavy lines) and 
i a 10, 0.3 (light lines), shown on 
| = 01 J linear vertical scales (top) and 
j ö i 5 3 7 Mie * as ign) Po logarithmic vertical scales 
(bottom); and shown as a 
1 function of x on the left and 
di oi l= lng on the right. 
0.01 | oe 
0.001 | 0.001 
0.0001 4 i 0.0001 Ler 
0 1 2 3 4 2 0 2 4 
v lnv 


Gamma and inverse gamma distributions crop up in many inference prob- 
lems in which a positive quantity is inferred from data. Examples include 
inferring the variance of Gaussian noise from some noise samples, and infer- 
ring the rate parameter of a Poisson distribution from the count. 

Gamma distributions also arise naturally in the distributions of waiting 
times between Poisson-distributed events. Given a Poisson process with rate 
A, the probability density of the arrival time x of the mth event is 














(An) ea 
AE) get (23.22) cn 
Ce ans 
o1 4 i 
a 7 š 0.05 + SSS, 
Log-normal distribution o a 
0 1 2 3 4 5 
Another distribution over a positive real number x is the log-normal distribu- 
tion, which is the distribution that results when l = ln x has a normal distri- 01 Poemes 
bution. We define m to be the median value of z, and s to be the standard 0.01 4 
deviation of ln z. soni 
1 (l— ln m)? 0.0001 A 
P(l| m, s) = 7 exP Gara l e (—00, 00), (23.23) 0 1 2 3 4 5 
h Figure 23.6. Two log-normal 
WARRE VIZ distributions, with parameters 
Z =V 2T, (23.24) (m, s) = (3, 1.8) (heavy line) and 
PORGE (3,0.7) (light line), shown on 
implies 


linear vertical scales (top) and 


(In 2 — In m)? ) fash logarithmic vertical scales 
x , 00 


11 
P(a|m,s) = zel 552 





(23.25) (bottom). [Yes, they really do 
have the same value of the 
median, m = 3.] 


»> 23.4 Distributions over periodic variables 


A periodic variable 6 is a real number € [0, 27] having the property that 6 = 0 
and 0 = 27 are equivalent. 

A distribution that plays for periodic variables the role played by the Gaus- 
sian distribution for real variables is the Von Mises distribution: 


P(@| u, B) = Z exp (Bcos(@—p)) 6 € (0,27). (23.26) 


The normalizing constant is Z = 27Io(3), where Jp(x) is a modified Bessel 
function. 
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A distribution that arises from Brownian diffusion around the circle is the 
wrapped Gaussian distribution, 


P(O|p,o) = 3 Normal(6; (u + 2an),07) 6 € (0,27). (23.27) 


n=— o0 


> 23.5 Distributions over probabilities 





Beta distribution, Dirichlet distribution, entropic distribution 0.4 | 
The beta distribution is a probability density over a variable p that is a prob- 0.3 | 
ability, p € (0,1): 
1 
Z (ui, u2) 





P(p|u1, u2) = pH — pert, (23.28) 
The parameters u1, u2 may take any positive value. The normalizing constant Figure 23.7. Three beta 
is the beta function, distributions, with 
T (u1) (u2) (u1, u2) = (0.3, 1), (1.3, T); and 

Z(u1, u2) = i (23.29) (12,2). The upper figure shows 
Tui + u2) , PP g 
P(p|u1, u2) as a function of p; the 
Special cases include the uniform distribution — uj=1,u2=1; the Jeffreys — lower shows the corresponding 


prior — uy =0.5, u2 =0.5; and the improper Laplace prior — uy=0,u2=0. If density over the logit, 
we transform the beta distribution to the corresponding density over the logit 

l= ln p/ (1 — p), we find it is always a pleasant bell-shaped density over l, while In 
the density over p may have singularities at p = 0 and p = 1 (figure 23.7). 


p 
l1-p` 





Notice how well-behaved the 

Í densities are as a function of the 
More dimensions logit. 
The Dirichlet distribution is a density over an -dimensional vector p whose 
I components are positive and sum to 1. The beta distribution is a special 
case of a Dirichlet distribution with J = 2. The Dirichlet distribution is 
parameterized by a measure u (a vector with all coefficients u; > 0) which 
I will write here as u = am, where m is a normalized measure over the I 


components (> m; = 1), and a is positive: 


I 
[[ eo?" 76 (£p: — 1) = Dirichlet (p|am). (23.30) 
i=l 


Hp one T 


The function ô(x) is the Dirac delta function, which restricts the distribution 
to the simplex such that p is normalized, i.e., 5>,;p; = 1. The normalizing 
constant of the Dirichlet distribution is: 


Zam) = | [T (am:) /T(a) . (23.31) 
The vector m is the mean of the probability distribution: 
J Pirie |am) pdp = m. (23.32) 


When working with a probability vector p, it is often helpful to work in the 
‘softmax basis’, in which, for example, a three-dimensional probability p = 
(p1, p2, p3) is represented by three numbers a1, a2, a3 satisfying a1 +a2 +a3 = 0 


and 1 
p= z e“, where Z = D0, e%. (23.33) 


This nonlinear transformation is analogous to the o — Ino transformation 
for a scale variable and the logit transformation for a single probability, p — 
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u = (20, 10, 7) u = (0.2, 1, 2) u = (0.2, 0.3, 0.15) Figure 23.8. Three Dirichlet 

A distributions over a 
three-dimensional probability 
vector (p1, p2, p3). The upper 
figures show 1000 random draws 
from each distribution, showing 
the values of pı and p2 on the two 
axes. p3 = 1 — (pı + p2). The 








8 z ig F 8 - triangle in the first figure is the 
| simplex of legal probability 
$ g s distributions. 


The lower figures show the same 
points in the ‘softmax’ basis 
(equation (23.33)). The two axes 
show a; and ag. a3 = —aı — ag. 





In;&. In the softmax basis, the ugly minus-ones in the exponents in the 
Dirichlet distribution (23.30) disappear, and the density is given by: 


JA 
P(a|am) « Aaa [[ e764). (23.34) 
i=1 


The role of the parameter a can be characterized in two ways. First, œ mea- 
sures the sharpness of the distribution (figure 23.8); it measures how different 
we expect typical samples p from the distribution to be from the mean m, just 
as the precision T = 1/o? of a Gaussian measures how far samples stray from its 
mean. A large value of a produces a distribution over p that is sharply peaked 1 
around m. The effect of a in higher-dimensional situations can be visualized MSc: 1 
by drawing a typical sample from the distribution Dirichlet g ip |am), with of Ñ * 7 Oe: 
m set to the uniform vector m; = /7, and making a Zipf plot, that is, a ranked t | nce kd £ 
plot of the values of the components p;. It is traditional to plot both p; (ver- 0.01 F a k a OSHS 
tical axis) and the rank (horizontal axis) on logarithmic scales so that power | ` k 
law relationships appear as straight lines. Figure 23.9 shows these plots for a 
single sample from ensembles with J = 100 and J = 1000 and with a from 0.1 | x l 
to 1000. For large a, the plot is shallow with many components having simi- Maw 10 100 
lar values. For small a, typically one component p; receives an overwhelming I = 1000 
share of the probability, and of the small probability that remains to be shared h 4 
among the other components, another component p; receives a similarly large = 


share. In the limit as aœ goes to zero, the plot tends to an increasingly steep j = 
A co ee a 7 





\ i 3 
0.001 F \ i te 4 

















1000 --- 





power law. | ; Ls 

Second, we can characterize the role of aœ in terms of the predictive dis- ooit | i % on J 
tribution that results when we observe samples from p and obtain counts A 4 x 
F = (Fi, fb,..., Fr) of the possible outcomes. The value of a defines the sia aa ; 1 


number of samples from p that are required in order that the data dominate te05 | —! a i T 











over the prior in predictions. 
Figure 23.9. Zipf plots for random 
Exercise 23.3.1] The Dirichlet distribution satisfies a nice additivity property. samples from Dirichlet 
Imagine that a biased six-sided die has two red faces and four blue faces. distributions with various values 
The die is rolled N times and two Bayesians examine the outcomes in °f& = 0.1... 1000. For each value 
order to infer the bias of the die and make predictions. One Bayesian e Pehi rae o oe 
has access to the red/blue colour outcomes only, and he infers a two- distribution was generated. The 
component probability vector (pr,pp). The other Bayesian has access Zipf plot shows the probabilities 
to each full outcome: he can see which of the six faces came up, and p;, ranked by magnitude, versus 


he infers a six-component probability vector (p1, P2, P3, P4, P5; pg), where their rank. 
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PR = pı + p2 and pp = p3 + p4 + ps + pe. Assuming that the sec- 
ond Bayesian assigns a Dirichlet distribution to (p1, p2, p3, p4, ps, pe) with 
hyperparameters (u1, U2, U3, U4, U5, Ue), Show that, in order for the first 
Bayesian’s inferences to be consistent with those of the second Bayesian, 
the first Bayesian’s prior should be a Dirichlet distribution with hyper- 
parameters ((u1 + u2), (u3 + u4 + Us + U6)). 


Hint: a brute-force approach is to compute the integral P(pr,pp) = 
J d®p P(p|u) d(pr — (pı + p2)) 6(pB — (p3 + pa + ps + pe)). A cheaper 
approach is to compute the predictive distributions, given arbitrary data 
(Fi, Fo, F3, F4, Fs, Fẹ), and find the condition for the two predictive dis- 
tributions to match for all data. 





The entropic distribution for a probability vector p is sometimes used in 
the ‘maximum entropy’ image reconstruction community. 


P(p |a, m) = ARTA exp[-aDxx(pl|m)] (20; pi — 1), (23.35) 


where m, the measure, is a positive vector, and Dxr(p||m) = >>; pi log pi/mi. 


Further reading 


See (MacKay and Peto, 1995) for fun with Dirichlets. 


> 23.6 Further exercises 


Exercise 23.4.[7] N datapoints {£n} are drawn from a gamma distribution 
P(a|s,c) = I(x; s,c) with unknown parameters s and c. What are the 
maximum likelihood parameters s and c? 
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24 


Exact Marginalization 


How can we avoid the exponentially large cost of complete enumeration of 
all hypotheses? Before we stoop to approximate methods, we explore two 
approaches to exact marginalization: first, marginalization over continuous 
variables (sometimes known as nuisance parameters) by doing integrals; and 
second, summation over discrete variables by message-passing. 

Exact marginalization over continuous parameters is a macho activity en- 
joyed by those who are fluent in definite integration. This chapter uses gamma 
distributions; as was explained in the previous chapter, gamma distributions 
are a lot like Gaussian distributions, except that whereas the Gaussian goes 
from —oo to co, gamma distributions go from 0 to oo. 


> 24.1 Inferring the mean and variance of a Gaussian distribution 


We discuss again the one-dimensional Gaussian distribution, parameterized 
by a mean u and a standard deviation oa: 


eee: 
P(«|p,o) = a = Normal (z; p, o°). (24.1) 


1 

e (- 207? 

When inferring these parameters, we must specify their prior distribution. 
The prior gives us the opportunity to include specific knowledge that we have 
about u and o (from independent experiments, or on theoretical grounds, for 
example). If we have no such knowledge, then we can construct an appropriate 
prior that embodies our supposed ignorance. In section 21.2, we assumed a 
uniform prior over the range of parameters plotted. If we wish to be able to 
perform exact marginalizations, it may be useful to consider conjugate priors; 
these are priors whose functional form combines naturally with the likelihood 
such that the inferences have a convenient form. 


Conjugate priors for u and o 


The conjugate prior for a mean p is a Gaussian: we introduce two ‘hy- 
perparameters’, po and op, which parameterize the prior on u, and write 
P(u| Lo, ou) = Normal(y; 110,04). In the limit o=0, op — œ, we obtain 
the noninformative prior for a location parameter, the flat prior. This is 
noninformative because it is invariant under the natural reparameterization 
u = +c. The prior P(u) = const. is also an improper prior, that is, it is not 
normalizable. 

The conjugate prior for a standard deviation ø is a gamma distribution, 
which has two parameters bg and cg. It is most convenient to define the prior 


319 
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density of the inverse variance (the precision parameter) 3 = 1/a?: 


1 Bee" 


P(8) =T(G; bg, cg) = Toa) oF exp (-=) , O0<B<o. (24.2) 
B 


This is a simple peaked distribution with mean bgcg and variance b3Cp- In 

the limit bgcg = 1,cg — 0, we obtain the noninformative prior for a scale 

parameter, the 1/o prior. This is ‘noninformative’ because it is invariant 

under the reparameterization o’ = co. The 1/0 prior is less strange-looking if 

we examine the resulting density over Ing, or ln 8, which is flat. This is the | Reminder: when we change 
prior that expresses ignorance about ø by saying ‘well, it could be 10, or it variables from ø to I(c), a 
could be 1, or it could be 0.1, ...’ Scale variables such as ø are usually best One-to-one function of ø, the 
represented in terms of their logarithm. Again, this noninformative 1/o prior AETR density trongtorms 
es tom P,(c) to 

is improper. 

In the following examples, I will use the improper noninformative priors 
for u and o. Using improper priors is viewed as distasteful in some circles, 
so let me excuse myself by saying it’s for the sake of readability; if I included 
proper priors, the calculations could still be done but the key points would be 
obscured by the flood of extra parameters. | do | | 

Olno 


Oo 
al 


Pl) = Ps (0) 








Here, the Jacobian is 





Maximum likelihood and marginalization: oy and Ona 


The task of inferring the mean and standard deviation of a Gaussian distribu- 
tion from N samples is a familiar one, though maybe not everyone understands 
the difference between the oy and oy_, buttons on their calculator. Let us 
recap the formulae, then derive them. 


Given data D = {zn}, an ‘estimator’ of u is 


cay aN, (24.3) 


and two estimators of ø are: 


= Dran =)" 


On = 








N z)2 
N and oni = 2 n #) i (24.4) 
There are two principal paradigms for statistics: sampling theory and Bayesian 
inference. In sampling theory (also known as ‘frequentist’ or orthodox statis- 
tics), one invents estimators of quantities of interest and then chooses between 
those estimators using some criterion measuring their sampling properties; 
there is no clear principle for deciding which criterion to use to measure the 
performance of an estimator; nor, for most criteria, is there any systematic 
procedure for the construction of optimal estimators. In Bayesian inference, 
in contrast, once we have made explicit all our assumptions about the model 
and the data, our inferences are mechanical. Whatever question we wish to 
pose, the rules of probability theory give a unique answer which consistently 
takes into account all the given information. Human-designed estimators and 
confidence intervals have no role in Bayesian inference; human input only en- 
ters into the important tasks of designing the hypothesis space (that is, the 
specification of the model and all its probability distributions), and figuring 
out how to do the computations that implement inference in that space. The 
answers to our questions are probability distributions over the quantities of 
interest. We often find that the estimators of sampling theory emerge auto- 
matically as modes or means of these posterior distributions when we choose 
a simple hypothesis space and turn the handle of Bayesian inference. 
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eel i Figure 24.1. The likelihood 
0,05": + o9 function for the parameters of a 
0.04 + { 08 Gaussian distribution, repeated 
0.03 4} 07 from figure 21.5. 
wae T °S soma (al, a2) Surface plot and contour 
a ] y plot of the log likelihood as a 
a function of u and o. The data set 
i | of N = 5 points had mean z = 1.0 
= i i , a and S = Ņ (x — 7)? = 1.0. Notice 
sigma 0 05 1 15 2 that the maximum is skew in o. 
(al) af A e (a2) oan The two estimators of standard 
0 mean deviation have values oy = 0.45 
and on = 0.50. 
0.09 oo (c) The posterior probability of o 
0.08 tess i ` J for various fixed values of u 
0.07 F S ‘i | (shown as a density over ln ø). 
0.06 | SN 1 (d) The posterior probability of ø, 
0.05 F if \ J P(o | D), assuming a flat prior on 
0.04 L / / i a Lt, obtained by projecting the 
0.03 L i] AAN J probability mass in (a) onto the o 
ot / “a | axis. The maximum of P(ø | D) is 
(c) (d) °F Wi So L] at ona. By contrast, the 
o beet = maximum of P(o | D, u= 77) is at 























= Seo Me oS we PRS oy. (Both probabilities are shows 
as densities over Ing.) 
In sampling theory, the estimators above can be motivated as follows. g is 
an unbiased estimator of u which, out of all the possible unbiased estimators 
of u, has smallest variance (where this variance is computed by averaging over 
an ensemble of imaginary experiments in which the data samples are assumed 
to come from an unknown Gaussian distribution). The estimator (z, ø xy) is the 
maximum likelihood estimator for (u,a). The estimator oy is biased, however: 
the expectation of oy, given o, averaging over many imagined experiments, is 
not o. 


Exercise 24.1.1% P-323] Give an intuitive explanation why the estimator oy is 
= biased. 


This bias motivates the invention, in sampling theory, of onı, which can be 
shown to be an unbiased estimator. Or to be precise, it is 72, that is an 
unbiased estimator of o?. 

We now look at some Bayesian inferences for this problem, assuming non- 
informative priors for 4 and ø. The emphasis is thus not on the priors, but 
rather on (a) the likelihood function, and (b) the concept of marginalization. 
The joint posterior probability of u and ø is proportional to the likelihood 
function illustrated by a contour plot in figure 24.la. The log likelihood is: 


In P({n}*_ 4,0) = —NIn(V2me) -Y> (en — 4)?/(20°), (24.5) 


= —NIn(V2r0) — [N(w— z)? + S]/(207), (24.6) 


where S = >>, (a, — 7)?. Given the Gaussian model, the likelihood can be 
expressed in terms of the two functions of the data z and S, so these two 
quantities are known as ‘sufficient statistics’. The posterior probability of u 
and o is, using the improper priors: 


P({&n a Li, o)P(p,o 
Puja {an} = tases no) (24.7) 
nJn=1 
1 N(u-2)?7+8 
crane exp (—NUSTE SS) oo 


PCat A) 
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This function describes the answer to the question, ‘given the data, and the 
noninformative priors, what might u and o be?’ It may be of interest to find 
the parameter values that maximize the posterior probability, though it should 
be emphasized that posterior probability maxima have no fundamental status 
in Bayesian inference, since their location depends on the choice of basis. Here 
we choose the basis (u,ln o), in which our prior is flat, so that the posterior 
probability maximum coincides with the maximum of the likelihood. As we 
saw in exercise 22.4 (p.302), the maximum likelihood solution for 4 and ng 
is {u, o }uu = {z,o = S/N} ; 

There is more to the posterior distribution than just its mode. As can 
be seen in figure 24.1la, the likelihood has a skew peak. As we increase o, 
the width of the conditional distribution of p increases (figure 22.1b). And 
if we fix u to a sequence of values moving away from the sample mean g, we 
obtain a sequence of conditional distributions over ø whose maxima move to 
increasing values of ø (figure 24.1c). 

The posterior probability of u given ø is 


P({an }ney |u, 0) Pu) 


P(u| {en} 130) Pea To) (24.9) 
x exp(—N(p — #)?/(207)) (24.10) 
= Normal(;%,07/N). (24.11) 


We note the familiar ¢/ VN scaling of the error bars on p. 

Let us now ask the question ‘given the data, and the noninformative priors, 
what might o be?’ This question differs from the first one we asked in that we 
are now not interested in u. This parameter must therefore be marginalized 
over. The posterior probability of ø is: 

_ P({tn}nalo)P() 


P(o | {rn} = PUI (24.12) 


The data-dependent term P({x,,}_, |c) appeared earlier as the normalizing 
constant in equation (24.9); one name for this quantity is the ‘evidence’, or 
marginal likelihood, for ø. We obtain the evidence for o by integrating out 
u; a noninformative prior P(u) = constant is assumed; we call this constant 
1/o,,, so that we can think of the prior as a top-hat prior of width øp. The 
Gaussian integral, P({en}¥_,|0) = f Pen}: |4, 0)P(1) dy, yields: 


2 +I1n v2no/VN 


Qo? 





In P({2n}e_, |o) = -N In(v2z0) (24.13) 


ou 

The first two terms are the best-fit log likelihood (i.e., the log likelihood with 
u = T). The last term is the log of the Occam factor which penalizes smaller 
values of ø. (We will discuss Occam factors more in Chapter 28.) When we 
differentiate the log evidence with respect to Ing, to find the most probable 
g, the additional volume factor (a/v N) shifts the maximum from oy to 


ona = VSN- D). (24.14) 


Intuitively, the denominator (N —1) counts the number of noise measurements 
contained in the quantity S = >>, (%,—Z)?. The sum contains N residuals 
squared, but there are only (N —1) effective noise measurements because the 
determination of one parameter u from the data causes one dimension of noise 
to be gobbled up in unavoidable overfitting. In the terminology of classical 
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statistics, the Bayesian’s best guess for ø sets x? (the measure of deviance 
defined by x? = $, (an — Ê)? /ô?) equal to the number of degrees of freedom, 
N-1. 

Figure 24.1d shows the posterior probability of ø, which is proportional 
to the marginal likelihood. This may be contrasted with the posterior prob- 
ability of ø with u fixed to its most probable value, Z=1, which is shown in 
figure 24.1c and d. 

The final inference we might wish to make is ‘given the data, what is u?’ 


> Exercise 24.2.1] Marginalize over o and obtain the posterior marginal distri- 
bution of u, which is a Student-t distribution: 


N/2 


P(u|D) « 1/(N(u—- 2)? + 8) (24.15) 


Further reading 


A bible of exact marginalization is Bretthorst’s (1988) book on Bayesian spec- 
trum analysis and parameter estimation. 


> 24.2 Exercises 


> Exercise 24.3.1] [This exercise requires macho integration capabilities.] Give fe i ae 
a Bayesian solution to exercise 22.15 (p.309), where seven scientists of t 
varying capabilities have measured u with personal noise levels op, l 
and we are interested in inferring u. Let the prior on each o, be a -30 -20 -10 0 10 20 


broad prior, for example a gamma distribution with parameters (s, c) = 
(10,0.1). Find the posterior distribution of u. Plot it, and explore its 
properties for a variety of data sets such as the one given, and the data 


set {£n} = {13.01, 7.39}. 


[Hint: first find the posterior distribution of o, given u and Zp, 
P(on| £n, u). Note that the normalizing constant for this inference is 
P(2, |). Marginalize over o,, to find this normalizing constant, then 
use Bayes’ theorem a second time to find P(j:| {2,}).] 


> 24.3 Solutions 


Solution to exercise 24.1 (p.321). 1. The data points are distributed with mean 
squared deviation g? about the true mean. 2. The sample mean is unlikely 
to exactly equal the true mean. 3. The sample mean is the value of u that 
minimizes the sum squared deviation of the data points from u. Any other 
value of u (in particular, the true value of u) will have a larger value of the 
sum-squared deviation that u = 7. 

So the expected mean squared deviation from the sample mean is neces- 
sarily smaller than the mean squared deviation g? about the true mean. 
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Exact Marginalization in Trellises 


In this chapter we will discuss a few exact methods that are used in proba- 
bilistic modelling. As an example we will discuss the task of decoding a linear 
error-correcting code. We will see that inferences can be conducted most effi- 
ciently by message-passing algorithms, which take advantage of the graphical 
structure of the problem to avoid unnecessary duplication of computations 
(see Chapter 16). 


»> 25.1 Decoding problems 


A codeword t is selected from a linear (N, K) code C, and it is transmitted 
over a noisy channel; the received signal is y. In this chapter we will assume 
that the channel is a memoryless channel such as a Gaussian channel. Given 
an assumed channel model P(y |t), there are two decoding problems. 


The codeword decoding problem is the task of inferring which codeword 
t was transmitted given the received signal. 


The bitwise decoding problem is the task of inferring for each transmit- 
ted bit tn how likely it is that that bit was a one rather than a zero. 


As a concrete example, take the (7,4) Hamming code. In Chapter 1, we 
discussed the codeword decoding problem for that code, assuming a binary 
symmetric channel. We didn’t discuss the bitwise decoding problem and we 
didn’t discuss how to handle more general channel models such as a Gaussian 
channel. 


Solving the codeword decoding problem 


By Bayes’ theorem, the posterior probability of the codeword t is 


P(t|y) = HPW 


Likelihood function. The first factor in the numerator, P(y |t), is the likeli- 
hood of the codeword, which, for any memoryless channel, is a separable 
function, 


(25.1) 


N 
P(y |t) = | | Pin | tn). (25.2) 


n=l 


For example, if the channel is a Gaussian channel with transmissions +x 
and additive noise of standard deviation o, then the probability density 


324 
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of the received signal yn in the two cases tn = 0,1 is 


exp (-4*) (25.3) 


P(yn|tn=1) 202 





n)2 
P(Yn | tn = 0) : exp (12) ; (25.4) 


From the point of view of decoding, all that matters is the likelihood 
ratio, which for the case of the Gaussian channel is 


P(yn | tn = 1) 2LYn, 
—— ee + 2 . 
Pao Ne en 





Exercise 25.1.!7] Show that from the point of view of decoding, a Gaussian 
= channel is equivalent to a time-varying binary symmetric channel with 
a known noise level fn which depends on n. 


Prior. The second factor in the numerator is the prior probability of the 
codeword, P(t), which is usually assumed to be uniform over all valid 
codewords. 


The denominator in (25.1) is the normalizing constant 


Ply) = X Ply|t)P(t). (25.6) 


The complete solution to the codeword decoding problem is a list of all 
codewords and their probabilities as given by equation (25.1). Since the num- 
ber of codewords in a linear code, 2%, is often very large, and since we are not 
interested in knowing the detailed probabilities of all the codewords, we often 
restrict attention to a simplified version of the codeword decoding problem. 


The MAP codeword decoding problem is the task of identifying the 
most probable codeword t given the received signal. 


If the prior probability over codewords is uniform then this task is iden- 
tical to the problem of maximum likelihood decoding, that is, identifying 
the codeword that maximizes P(y |t). 


Example: In Chapter 1, for the (7,4) Hamming code and a binary symmetric 
channel we discussed a method for deducing the most probable codeword from 
the syndrome of the received signal, thus solving the MAP codeword decoding 
problem for that case. We would like a more general solution. 

The MAP codeword decoding problem can be solved in exponential time 
(of order 2) by searching through all codewords for the one that maximizes 
P(y|t)P(t). But we are interested in methods that are more efficient than 
this. In section 25.3, we will discuss an exact method known as the min—sum 
algorithm which may be able to solve the codeword decoding problem more 
efficiently; how much more efficiently depends on the properties of the code. 

It is worth emphasizing that MAP codeword decoding for a general lin- 
ear code is known to be NP-complete (which means in layman’s terms that 
MAP codeword decoding has a complexity that scales exponentially with the 
blocklength, unless there is a revolution in computer science). So restrict- 
ing attention to the MAP decoding problem hasn’t necessarily made the task 
much less challenging; it simply makes the answer briefer to report. 
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Solving the bitwise decoding problem 


Formally, the exact solution of the bitwise decoding problem is obtained from 
equation (25.1) by marginalizing over the other bits. 


> P(t ly). 


{tw n En} 


Pta |y) = (25.7) 


We can also write this marginal with the aid of a truth function 1[S] that is 
one if the proposition S is true and zero otherwise. 


Yo Pt ly) ttn =1] 
> P(t ly) tlin =0]. 





P(ta=1|y) = (25.8) 








P(ta=0|y) = (25.9) 





Computing these marginal probabilities by an explicit sum over all codewords 
t takes exponential time. But, for certain codes, the bitwise decoding problem 
can be solved much more efficiently using the forward—backward algorithm. 
We will describe this algorithm, which is an example of the sum—product 
algorithm, in a moment. Both the min-sum algorithm and the sum—product 
algorithm have widespread importance, and have been invented many times 
in many fields. 


> 25.2 Codes and trellises (b) 


In Chapters 1 and 11, we represented linear (N, K) codes in terms of their 
generator matrices and their parity-check matrices. In the case of a systematic 
block code, the first K transmitted bits in each block of size N are the source 
bits, and the remaining M = N — K bits are the parity-check bits. This means 
that the generator matrix of the code can be written 


Simple parity code P3 












































G'= | T | r (25.10) (© 
and the parity-check matrix can be written J H 
H=[P Iw], (25.11) ola ols 


where P isan M x K matrix. (7,4) Hamming code 


In this section we will study another representation of a linear code called a Figure 25.1. Examples of trellises. 


trellis. The codes that these trellises represent will not in general be systematic 
codes, but they can be mapped onto systematic codes if desired by a reordering 
of the bits in a block. 


Definition of a trellis 


Our definition will be quite narrow. For a more comprehensive view of trellises, 
the reader should consult Kschischang and Sorokine (1995). 


A trellis is a graph consisting of nodes (also known as states or vertices) and 
edges. The nodes are grouped into vertical slices called times, and the 
times are ordered such that each edge connects a node in one time to 
a node in a neighbouring time. Every edge is labelled with a symbol. 
The leftmost and rightmost states contain only one node. Apart from 
these two extreme nodes, all nodes in the trellis have at least one edge 
connecting leftwards and at least one connecting rightwards. 


Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 
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A trellis with N+1 times defines a code of blocklength N as follows: a 
codeword is obtained by taking a path that crosses the trellis from left to right 
and reading out the symbols on the edges that are traversed. Each valid path 
through the trellis defines a codeword. We will number the leftmost time ‘time 
0’ and the rightmost ‘time N’. We will number the leftmost state ‘state 0’ 
and the rightmost ‘state I’, where T is the total number of states (vertices) in 
the trellis. The nth bit of the codeword is emitted as we move from time n—1 
to time n. 

The width of the trellis at a given time is the number of nodes in that 
time. The maximal width of a trellis is what it sounds like. 

A trellis is called a linear trellis if the code it defines is a linear code. We will 
solely be concerned with linear trellises from now on, as nonlinear trellises are 
much more complex beasts. For brevity, we will only discuss binary trellises, 
that is, trellises whose edges are labelled with zeroes and ones. It is not hard 
to generalize the methods that follow to q-ary trellises. 

Figures 25.1(a-c) show the trellises corresponding to the repetition code 
R3 which has (N, K) = (3,1); the parity code P3 with (N, K) = (3,2); and 
the (7,4) Hamming code. 


> Exercise 25.2.2 ] Confirm that the sixteen codewords listed in table 1.14 are 
generated by the trellis shown in figure 25.1c. 


Observations about linear trellises 


For any linear code the minimal trellis is the one that has the smallest number 
of nodes. In a minimal trellis, each node has at most two edges entering it and 
at most two edges leaving it. All nodes in a time have the same left degree as 
each other and they have the same right degree as each other. The width is 
always a power of two. 
A minimal trellis for a linear (N, K) code cannot have a width greater than 
since every node has at least one valid codeword through it, and there are 
only 2% codewords. Furthermore, if we define M = N — K, the minimal 
trellis’s width is everywhere less than 2™. This will be proved in section 25.4. 

Notice that for the linear trellises in figure 25.1, all of which are minimal 
trellises, K is the number of times a binary branch point is encountered as the 
trellis is traversed from left to right or from right to left. 

We will discuss the construction of trellises more in section 25.4. But we 
now know enough to discuss the decoding problem. 


2K 


> 25.3 Solving the decoding problems on a trellis 


We can view the trellis of a linear code as giving a causal description of the 
probabilistic process that gives rise to a codeword, with time flowing from left 
to right. Each time a divergence is encountered, a random source (the source 
of information bits for communication) determines which way we go. 

At the receiving end, we receive a noisy version of the sequence of edge- 
labels, and wish to infer which path was taken, or to be precise, (a) we want 
to identify the most probable path in order to solve the codeword decoding 
problem; and (b) we want to find the probability that the transmitted symbol 
at time n was a zero or a one, to solve the bitwise decoding problem. 


Example 25.3. Consider the case of a single transmission from the Hamming 
(7,4) trellis shown in figure 25.1c. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


328 


t Likelihood 


0000000 0.0275562 
0001011 0.0001458 


25 — Exact Marginalization in Trellises 


Figure 25.2. Posterior probabilities 
over the sixteen codewords when 
0.25 Ca the received vector y has 

0.0013 | normalized likelihoods 


Posterior probability 


0010111 0.0013122 0.012 | 
0011100 0.0030618 0.027 1 
0100110 0.0002268 0.0020 ! 
0101101 0.0000972 0.0009 |! 
0110001 0.0708588 0.63 a 
0111010 0.0020412 0.018 I 
1000101 0.0001458 0.0013 |! 
1001110 0.0000042 0.0000 ! 
1010010 0.0030618 0.027 1 
1011001 0.0013122 0.012 | 
1100011 0.0000972 0.0009 |! 
1101000 0.0002268 0.0020 | 
1110100 0.0020412 0.018 I 
1111111 0.0000108 0.0001 ! 


Let the normalized likelihoods be: (0.1,0.4,0.9,0.1,0.1,0.1,0.3). That is, 


the ratios of the likelihoods are 


Ply |21=1) 0.1 P(y2|x2=1) 0.4 
oS eee ES ee te 25.12 
Ply | 21 =0) 0.9 P(y2 | x2=0) 0.6 ( ) 


How should this received signal be decoded? 


1. If we threshold the likelihoods at 0.5 to turn the signal into a bi- 


nary received vector, we have r = (0,0,1,0,0,0,0), which decodes, 
using the decoder for the binary symmetric channel (Chapter 1), into 
t = (0,0,0, 0,0,0,0). 


This is not the optimal decoding procedure. Optimal inferences are 
always obtained by using Bayes’ theorem. 


2. We can find the posterior probability over codewords by explicit enu- 


meration of all sixteen codewords. This posterior distribution is shown 
in figure 25.2. Of course, we aren’t really interested in such brute-force 
solutions, and the aim of this chapter is to understand algorithms for 
getting the same information out in less than 2% computer time. 


Examining the posterior probabilities, we notice that the most probable 
codeword is actually the string t = 0110001. This is more than twice as 
probable as the answer found by thresholding, 0000000. 


Using the posterior probabilities shown in figure 25.2, we can also com- 
pute the posterior marginal distributions of each of the bits. The result 
is shown in figure 25.3. Notice that bits 1, 4, 5 and 6 are all quite con- 
fidently inferred to be zero. The strengths of the posterior probabilities 
for bits 2, 3, and 7 are not so great. Oo 





In the above example, the MAP codeword is in agreement with the bitwise 


decoding that is obtained by selecting the most probable state for each bit 
using the posterior marginal distributions. But this is not always the case, as 
the following exercise shows. 


(0.1, 0.4, 0.9, 0.1, 0.1, 0.1, 0.3). 
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Figure 25.3. Marginal posterior 


n Likelihood Posterior marginals probabilities for the T bits under 
Pok | Mat) Poy | tn =0) P(ty=1 | y) P(tn =0 | y) A A distribution of 

1 0.1 0.9 0.061 O 0.939 C-—__] 

2 0.4 0.6 0.674 E— 0.326 C4 

3 0.9 0.1 0.746 E 0.254 CA 

4 0.1 0.9 0.061 O 0.939 L 

5 0.1 0.9 0.061 O 0.939 L 

6 0.1 0.9 0.061 O 0.939 L 

7 0.3 0.7 0.659 E 9.341 C4 


Exercise 25.4.7 P-333] Find the most probable codeword in the case where 
= the normalized likelihood is (0.2, 0.2, 0.9, 0.2, 0.2, 0.2,0.2). Also find or 
estimate the marginal posterior probability for each of the seven bits, 

and give the bit-by-bit decoding. 


[Hint: concentrate on the few codewords that have the largest probabil- 
ity.] 


We now discuss how to use message passing on a code’s trellis to solve the 
decoding problems. 


The min-sum algorithm 


The MAP codeword decoding problem can be solved using the min-sum al- 
gorithm that was introduced in section 16.3. Each codeword of the code 
corresponds to a path across the trellis. Just as the cost of a journey is the 
sum of the costs of its constituent steps, the log likelihood of a codeword is 
the sum of the bitwise log likelihoods. By convention, we flip the sign of the 
log likelihood (which we would like to maximize) and talk in terms of a cost, 
which we would like to minimize. 

We associate with each edge a cost —log P(yn | tn), where tn is the trans- 
mitted bit associated with that edge, and yn is the received symbol. The 
min-sum algorithm presented in section 16.3 can then identify the most prob- 
able codeword in a number of computer operations equal to the number of 
edges in the trellis. This algorithm is also known as the Viterbi algorithm 
(Viterbi, 1967). 


The sum-product algorithm 


To solve the bitwise decoding problem, we can make a small modification to 
the min-sum algorithm, so that the messages passed through the trellis define 
‘the probability of the data up to the current point’ instead of ‘the cost of the 
best route to this point’. We replace the costs on the edges, —log P(yn | tn), by 
the likelihoods themselves, P(yn | tn). We replace the min and sum operations 
of the min-sum algorithm by a sum and product respectively. 

Let i run over nodes/states, i = 0 be the label for the start state, P (i) 
denote the set of states that are parents of state i, and w;; be the likelihood 
associated with the edge from node j to node i. We define the forward-pass 
messages a; by 
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Qi = 5 Wij Qj. (25.13) 
JEP (i) 
These messages can be computed sequentially from left to right. 
> Exercise 25.5.1°] Show that for a node i whose time-coordinate is n, Qi is 


proportional to the joint probability that the codeword’s path passed 
through node i and that the first n received symbols were y1,..., Yn- 


The message a; computed at the end node of the trellis is proportional to the 
marginal probability of the data. 


> Exercise 25.6.1°] What is the constant of proportionality? [Answer: 2#] 


We define a second set of backward-pass messages (3; in a similar manner. 
Let node I be the end node. 


br = 1 
i:j EP (t) 
These messages can be computed sequentially in a backward pass from right 
to left. 


> Exercise 25.7.1°] Show that for a node i whose time-coordinate is n, ĝi is 
proportional to the conditional probability, given that the codeword’s 
path passed through node i, that the subsequent received symbols were 


Yn+1---YN- 

Finally, to find the probability that the nth bit was a 1 or 0, we do two 
summations of products of the forward and backward messages. Let 7 run over 
nodes at time n and j run over nodes at time n — 1, and let ¢;; be the value 
of t associated with the trellis edge from node j to node i. For each value of 
t = 0/1, we compute 


7 = 5 QjWij bi- (25.15) 
ij: jEP (0), tij=t 


Then the posterior probability that tn was t = 0/1 is 


(25.16) 


n? 


1 
Plty=tly) = or 


where the normalizing constant Z = 70) + rØ should be identical to the final 


forward message a; that was computed earlier. n P(Yn | tn) 
Exercise 25.8.!7] Confirm that the above sum-—product algorithm does com- tn=0 h= 
pute P(t,=t|y). 1 1⁄4 1/2 

; 2 12 1/4 

Other names for the sum-product algorithm presented here are ‘the forward- 3 Vs 1) 


backward algorithm’, ‘the BCJR algorithm’, and ‘belief propagation’. 


> Exercise 25.9.1% P-333] A codeword of the simple parity code P} is transmitted, Table 25.4. Bitwise likelihoods for 

and the received signal y has associated likelihoods shown in table 25.4. a codeword of P3. 

Use the min-sum algorithm and the sum—product algorithm in the trellis 

(figure 25.1) to solve the MAP codeword decoding problem and the 

bitwise decoding problem. Confirm your answers by enumeration of 

all codewords (000, 011, 110, 101). [Hint: use logs to base 2 and do 

the min-sum computations by hand. When working the sum—product 

algorithm by hand, you may find it helpful to use three colours of pen, 

one for the as, one for the ws, and one for the (3s.] 
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> 25.4 More on trellises 


We now discuss various ways of making the trellis of a code. You may safely 
jump over this section. 

The span of a codeword is the set of bits contained between the first bit in 
the codeword that is non-zero, and the last bit that is non-zero, inclusive. We 
can indicate the span of a codeword by a binary vector as shown in table 25.5. 


Codeword 0000000 0001011 0100110 1100011 0101101 ote Some codewords and 
Span 0000000 0001111 0111110 1111111 0111111 bia co 


A generator matrix is in trellis-oriented form if the spans of the rows of the 
generator matrix all start in different columns and the spans all end in different 
columns. 


How to make a trellis from a generator matrix 


First, put the generator matrix into trellis-oriented form by row-manipulations 
similar to Gaussian elimination. For example, our (7,4) Hamming code can 
be generated by 


1000101 
0100110 

Ge 0010111 (an 
0001011 


but this matrix is not in trellis-oriented form — for example, rows 1, 3 and 4 
all have spans that end in the same column. By subtracting lower rows from 
upper rows, we can obtain an equivalent generator matrix (that is, one that 
generates the same set of codewords) as follows: 


(25.18) 


o Ee rF 
D- Oj H 
oOroo°o 
PROF 
Corr oO 
For oO 
Fe: Si) 


Now, each row of the generator matrix can be thought of as defining an 
(N,1) subcode of the (N, K) code, that is, in this case, a code with two 
codewords of length N = 7. For the first row, the code consists of the two 
codewords 1101000 and 0000000. The subcode defined by the second row 
consists of 0100110 and 0000000. It is easy to construct the minimal trellises 
of these subcodes; they are shown in the left column of figure 25.6. 

We build the trellis incrementally as shown in figure 25.6. We start with 
the trellis corresponding to the subcode given by the first row of the generator 
matrix. Then we add in one subcode at a time. The vertices within the span 
of the new subcode are all duplicated. The edge symbols in the original trellis 
are left unchanged and the edge symbols in the second part of the trellis are 
flipped wherever the new subcode has a 1 and otherwise left alone. 

Another (7,4) Hamming code can be generated by 


1 1100 0 0 
0 1 111100 

aS 0010110 Go) 
0 0 0 1 1 1 1 
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H 
H 











The (7,4) Hamming code generated by this matrix differs by a permutation 
of its bits from the code generated by the systematic matrix used in Chapter 
1 and above. The parity-check matrix corresponding to this permutation is: 


1 0 10 10 1 
H=;0 1 100 1 1 (25.20) 
000 1 1 1 1 
The trellis obtained from the permuted matrix G given in equation (25.19) is 
shown in figure 25.7a. Notice that the number of nodes in this trellis is smaller 
than the number of nodes in the previous trellis for the Hamming (7,4) code 


in figure 25.1c. We thus observe that rearranging the order of the codeword 
bits can sometimes lead to smaller, simpler trellises. 


Trellises from parity-check matrices 


Another way of viewing the trellis is in terms of the syndrome. The syndrome 
of a vector r is defined to be Hr, where H is the parity-check matrix. A vector 
is only a codeword if its syndrome is zero. As we generate a codeword we can 
describe the current state by the partial syndrome, that is, the product of 
H with the codeword bits thus far generated. Each state in the trellis is a 
partial syndrome at one time coordinate. The starting and ending states are 
both constrained to be the zero syndrome. Each node in a state represents a 
different possible value for the partial syndrome. Since H is an M x N matrix, 
where M = N — K, the syndrome is at most an M-bit vector. So we need at 
most 2M nodes in each state. We can construct the trellis of a code from its 
parity-check matrix by walking from each end, generating two trees of possible 
syndrome sequences. The intersection of these two trees defines the trellis of 
the code. 

In the pictures we obtain from this construction, we can let the vertical 
coordinate represent the syndrome. Then any horizontal edge is necessarily 
associated with a zero bit (since only a non-zero bit changes the syndrome) 


Figure 25.6. Trellises for four 
subcodes of the (7,4) Hamming 
code (left column), and the 
sequence of trellises that are made 
when constructing the trellis for 
the (7,4) Hamming code (right 
column). 

Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 





Figure 25.7. Trellises for the 
permuted (7,4) Hamming code 
generated from (a) the generator 
matrix by the method of 

figure 25.6; (b) the parity-check 
matrix by the method on page 
332. 

Each edge in a trellis is labelled 
by a zero (shown by a square) or 
a one (shown by a cross). 
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and any non-horizontal edge is associated with a one bit. (Thus in this rep- 
resentation we no longer need to label the edges in the trellis.) Figure 25.7b 
shows the trellis corresponding to the parity-check matrix of equation (25.20). 


> 25.5 Solutions 


Table 25.8. The posterior 


t Likelihood Posterior probability ee i codewords for 
0000000 0.026 0.3006 CL — 
0001011 0.00041 0.0047 | 
0010111 0.0037 0.0423 I 
0011100 0.015 0.1691 CO 
0100110 0.00041 0.0047 | 
0101101 0.00010 0.0012 | 
0110001 0.015 0.1691 CO 
0111010 0.0037 0.0423 I 
1000101 0.00041 0.0047 | 
1001110 0.00010 0.0012 | 
1010010 0.015 0.1691 CO 
1011001 0.0037 0.0423 D 
1100011 0.00010 0.0012 | 
1101000 0.00041 0.0047 | 
1110100 0.0037 0.0423 [I 

l 


1111111 0.000058 0.0007 


Solution to exercise 25.4 (p.329). The posterior probability over codewords is 
shown in table 25.8. The most probable codeword is 0000000. The marginal 
posterior probabilities of all seven bits are: 





n Likelihood Posterior marginals 
P(yn|tn=1) P(Yn | tn =0) P(tn=1]|y) P(tn =0ly) 

1 0.2 0.8 0.266 A 0.734 L 
2 0.2 0.8 0.266 LA 0.734 L 
3 0.9 0.1 0.677 CF ———1 0.323 FS 

4 0.2 0.8 0.266 LO 0.734 FC] 
5 0.2 0.8 0.266 A 0.734. ( ——] 
6 0.2 0.8 0.266 C4 0.734 HJ] 
7 0.2 0.8 0.266 Co 0.734 HJ] 


So the bitwise decoding is 0010000, which is not actually a codeword. 


Solution to exercise 25.9 (p.330). The MAP codeword is 101, and its like- 
lihood is 1/8. The normalizing constant of the sum—product algorithm is 
Z =a; = 3/16. The intermediate a; are (from left to right) 1/2, 1/4, 5/16, 4/16; 
the intermediate 3; are (from right to left), 1/2, 1/8, 9/32, 3/16. The bitwise 
decoding is: P(t; =1|y) = 3/4; P(t, =1|y) = 1/4; P(t: =1|y) = 5/6. The 
codewords’ probabilities are 4/12, 2/12, 1/12, 8/12 for 000, 011, 110, 101. 
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Exact Marginalization in Graphs 


We now take a more general view of the tasks of inference and marginalization. 
Before reading this chapter, you should read about message passing in Chapter 
16. 

»> 26.1 The general problem 


Assume that a function P* of a set of N variables x = gree Cae is defined as 
a product of M factors as follows: 


M 
P*(x) = [| fale): (26.1) 
m=1 


Each of the factors fm(Xm) is a function of a subset Xm of the variables that 
make up x. If P* is a positive function then we may be interested in a second 
normalized function, 


M 
P(x) = P(x) =  [] mEn), (26.2) 
m=1 


where the normalizing constant Z is defined by 


M 
Z= 0] n). (26.3) 


x m=1 


As an example of the notation we’ve just introduced, here’s a function of 
three binary variables x1, £2, x3 defined by the five factors: 


fe ee, 
CO ta E 
files) = {on ad 
falti, £2) = 1 (zı, £2)= (0,0) or (1,1) er 
A\ 41, 42 0 (#1, 22) = (1,0) or (0,1) 
1 (x2,x3)= (0,0) or (1,1) 
fs(£2, £3) = { 0 emery or (0, 1) 


P*(x) = fi(v1) fo(v2) f3 (x3) fa(@1, £2) f5(£2, £3) 
P(x) = fila) fo(a2) fa(ws) falar, £2) fs (22, 23). 


The five subsets of {x1, £2, £3} denoted by Xm in the general function (26.1) 
are here xı = {x1}, Xo = {vo}, x3 = {v3}, x4 = {x1, 22}, and x5 = {x, 23}. 
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> 


26.1: The general problem 


The function P(x), by the way, may be recognized as the posterior prob- 
ability distribution of the three transmitted bits in a repetition code (section 
1.2) when the received signal is r = (1,1,0) and the channel is a binary sym- 
metric channel with flip probability 0.1. The factors f4 and f5 respectively 
enforce the constraints that zı and x2 must be identical and that z and z3 
must be identical. The factors fı, fo, f3 are the likelihood functions con- 
tributed by each component of r. 

A function of the factored form (26.1) can be depicted by a factor graph, in 
which the variables are depicted by circular nodes and the factors are depicted 
by square nodes. An edge is put between variable node n and factor node m 
if the function fm(Xm) has any dependence on variable zn. The factor graph 
for the example function (26.4) is shown in figure 26.1. 


The normalization problem 


The first task to be solved is to compute the normalizing constant Z. 


The marginalization problems 


The second task to be solved is to compute the marginal function of any 
variable £n, defined by 


Zoi X PR) (26.5) 
{zwu h NEN 


For example, if f is a function of three variables then the marginal for 

n = 1 is defined by 
Zi(z1) = $. f(w1,2, 03). (26.6) 
T2,%3 

This type of summation, over ‘all the £w except for £y is so important that it 
can be useful to have a special notation for it — the ‘not-sum’ or ‘summary’. 

The third task to be solved is to compute the normalized marginal of any 
variable £n, defined by 


{zw h NEn 


[We include the suffix ‘n’ in Pa(£n), departing from our normal practice in the 
rest of the book, where we would omit it.] 


Exercise 26.1.!4] Show that the normalized marginal is related to the marginal 
Zn(@n) by 
Zn(2n) 


Palan) =. (26.8) 





We might also be interested in marginals over a subset of the variables, 
such as 


Z12(X1, £2) => XO P* (a1, £2, £3). (26.9) 
T3 


All these tasks are intractable in general. Even if every factor is a function 
of only three variables, the cost of computing exact solutions for Z and for 
the marginals is believed in general to grow exponentially with the number of 
variables N. 

For certain functions P*, however, the marginals can be computed effi- 
ciently by exploiting the factorization of P*. The idea of how this efficiency 
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Tı T2 T3 


T Y 


Oð 
h k h fa Bb 


Figure 26.1. The factor graph 
associated with the function 
P*(x) (26.4). 
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arises is well illustrated by the message-passing examples of Chapter 16. The 
sum—product algorithm that we now review is a generalization of message- 
passing rule-set B (p.242). As was the case there, the sum—product algorithm 
is only valid if the graph is tree-like. 


»> 26.2 The sum-product algorithm 


Notation 


We identify the set of variables that the mth factor depends on, xm, by the set 
of their indices N (m). For our example function (26.4), the sets are N (1) = 
{1} (since fı is a function of x; alone), V(2) = {2}, N(3) = {3}, N(4) = 
{1,2}, and N(5) = {2,3}. Similarly we define the set of factors in which 
variable n participates, by M(n). We denote a set N (m) with variable n 
excluded by M(m)\n. We introduce the shorthand x,,\n or Xm\n to denote 
the set of variables in Xm with x, excluded, i.e., 


Xm\n = {zw : n E N(m)\n}. (26.10) 


The sum-product algorithm will involve messages of two types passing 
along the edges in the factor graph: messages qn—m from variable nodes to 
factor nodes, and messages r+, from factor nodes to variable nodes. A 
message (of either type, q or r) that is sent along an edge connecting factor 
fm to variable x, is always a function of the variable £n. 

Here are the two rules for the updating of the two sets of messages. 


From variable to factor: 


Qn—m(“n) = II Tm! —sn(Ln)- 


mEM(n)\m 


Tn 


From factor to variable: 


TmonlEn) = X tae [| wmn) |. (26.12) 


Xm\n n'EN (m)\n Ím 

















Figure 26.2. A factor node that is 
a leaf node perpetually sends the 
message rm>n(£n) = fm(£n) to 
A node that has only one edge connecting it to another node is called a leaf its one neighbour zp. 

node. 

Some factor nodes in the graph may be connected to only one vari- 
able node, in which case the set M(m)\n of variables appearing in the fac- 
tor message update (26.12) is an empty set, and the product of functions 
LInenim\n qn'—m(£n') is the empty product, whose value is 1. Such a fac- 
tor node therefore always broadcasts to its one neighbour £n the message Fm 
Tm=n(tn) = fm ltn) 

Similarly, there may be variable nodes that are connected to only one Figure 26.3. A variable node that 
factor node, so the set M(n)\m in (26.11) is empty. These nodes perpetually is a leaf node perpetually sends 
broadcast the message qn=>m(£n) = 1. the message qn=m(£n) = 1. 


How these rules apply to leaves in the factor graph 


Tn 





Starting and finishing, method 1 


The algorithm can be initialized in two ways. If the graph is tree-like then 
it must have nodes that are leaves. These leaf nodes can broadcast their 
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messages to their respective neighbours from the start. 


For all leaf variable nodes n: Qnm(4n) = 1 (26.13) 
For all leaf factor nodes m: Trm=n(£n) = fm(£n). (26.14) 


We can then adopt the procedure used in Chapter 16’s message-passing rule- 
set B (p.242): a message is created in accordance with the rules (26.11, 26.12) 
only if all the messages on which it depends are present. For example, in i iy £3 
figure 26.4, the message from xı to fı will be sent only when the message 
from f4 to zı has been received; and the message from x2 to f2, q2—2, can be a 
sent only when the messages r4—2 and r5_,2 have both been received. m c 
Messages will thus flow through the tree, one in each direction along every fı fa Is fa fs 
edge, and after a number of steps equal to the diameter of the graph, every Figure 26.4. Our model factor 
message will have been created. graph for the function P*(x) 
The answers we require can then be read out. The marginal function of (26.4). 
£n is obtained by multiplying all the incoming messages at that node. 


Znan) = || 7 teaen)s (26.15) 
meEM(n) 


The normalizing constant Z can be obtained by summing any marginal 
function, Z = >°,,, Zn(n), and the normalized marginals obtained from 


Zn(Tn) 


Pa(@n) = y (26.16) 





> Exercise 26.2.1] Apply the sum-product algorithm to the function defined in 
equation (26.4) and figure 26.1. Check that the normalized marginals 
are consistent with what you know about the repetition code Rg. 


Exercise 26.3.!9] Prove that the sum-—product algorithm correctly computes 
the marginal functions Z,,(2,) if the graph is tree-like. 


Exercise 26.4.19] Describe how to use the messages computed by the sum- 
product algorithm to obtain more complicated marginal functions in a 
tree-like graph, for example Z1,2(x1, x2), for two variables xı and x2 that 
are connected to one common factor node. 


Starting and finishing, method 2 


Alternatively, the algorithm can be initialized by setting all the initial mes- 
sages from variables to 1: 


for all n, m: dnom(2n) = 1, (26.17) 


then proceeding with the factor message update rule (26.12), alternating with 
the variable message update rule (26.11). Compared with method 1, this lazy 
initialization method leads to a load of wasted computations, whose results 
are gradually flushed out by the correct answers computed by method 1. 

After a number of iterations equal to the diameter of the factor graph, 
the algorithm will converge to a set of messages satisfying the sum—product 
relationships (26.11, 26.12). 


Exercise 26.5.1] Apply this second version of the sum—product algorithm to 
the function defined in equation (26.4) and figure 26.1. 
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The reason for introducing this lazy method is that (unlike method 1) it can 
be applied to graphs that are not tree-like. When the sum—product algorithm 
is run on a graph with cycles, the algorithm does not necessarily converge, 
and certainly does not in general compute the correct marginal functions; but 
it is nevertheless an algorithm of great practical importance, especially in the 
decoding of sparse-graph codes. 


Sum—product algorithm with on-the-fly normalization 


If we are interested in only the normalized marginals, then another version 
of the sum—product algorithm may be useful. The factor-to-variable messages 
Tm—n are computed in just the same way (26.12), but the variable-to-factor 
messages are normalized thus: 


Qnm(Zn) = Anm Il Tm! —n(£n) (26.18) 
m!'EM(n)\m 


where @nm is a scalar chosen such that 


5 Qn—-m(£n) =1. (26.19) 


Exercise 26.6.17] Apply this normalized version of the sum—product algorithm 
to the function defined in equation (26.4) and figure 26.1. 


A factorization view of the sum—product algorithm 


One way to view the sum—product algorithm is that it reexpresses the original 
factored function, the product of M factors P*(x) = fe fm(Xm), as another 
factored function which is the product of M + N factors, 


M N 
P*(x) = [] m&m) [] ¢n(2n). (26.20) 
m=1 n=1 


Each factor $m, is associated with a factor node m, and each factor Yn(£n) is 
associated with a variable node. Initially ém(Xm) = fm(Xm) and Yn(zn) = 1. 
Each time a factor-to-variable message rm—+n(%n) is sent, the factorization 
is updated thus: 
mEM (n) 
f (Xm) 


bm(Xm) = a | (26.22) 





And each message can be computed in terms of ¢ and w using 


Trini) X | ban.) vwaw) (26.23) 


Xm\n n'EN(m) 


which differs from the assignment (26.12) in that the product is over all n’ € 
N(m). 


Exercise 26.7.!7] Confirm that the update rules (26.21-26.23) are equivalent 
to the sum—product rules (26.11-26.12). So Yn(£n) eventually becomes 
the marginal Zn (£n). 


This factorization viewpoint applies whether or not the graph is tree-like. 
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Computational tricks 


On-the-fly normalization is a good idea from a computational point of view 
because if P* is a product of many factors, its values are likely to be very large 
or very small. 

Another useful computational trick involves passing the logarithms of the 
messages q and r instead of q and r themselves; the computations of the 
products in the algorithm (26.11, 26.12) are then replaced by simpler additions. 
The summations in (26.12) of course become more difficult: to carry them out 
and return the logarithm, we need to compute softmax functions like 


l= ln(e" + e? + e’). (26.24) 


But this computation can be done efficiently using look-up tables along with 
the observation that the value of the answer l is typically just a little larger 
than max; l;. If we store in look-up tables values of the function 


In(1 + e?) (26.25) 


(for negative ô) then l can be computed exactly in a number of look-ups and 
additions scaling as the number of terms in the sum. If look-ups and sorting 
operations are cheaper than exp() then this approach costs less than the 
direct evaluation (26.24). The number of operations can be further reduced 
by omitting negligible contributions from the smallest of the {l;}. 

A third computational trick applicable to certain error-correcting codes is 
to pass not the messages but the Fourier transforms of the messages. This 
again makes the computations of the factor-to-variable messages quicker. A 
simple example of this Fourier transform trick is given in Chapter 47 at equa- 
tion (47.9). 


> 26.3 The min-sum algorithm 


The sum—product algorithm solves the problem of finding the marginal func- 
tion of a given product P*(x). This is analogous to solving the bitwise decod- 
ing problem of section 25.1. And just as there were other decoding problems 
(for example, the codeword decoding problem), we can define other tasks 
involving P*(x) that can be solved by modifications of the sum—product algo- 
rithm. For example, consider this task, analogous to the codeword decoding 
problem: 


The maximization problem. Find the setting of x that maximizes the 
product P*(x). 


This problem can be solved by replacing the two operations add and mul- 
tiply everywhere they appear in the sum—product algorithm by another pair 
of operations that satisfy the distributive law, namely max and multiply. If 
we replace summation (+, >>) by maximization, we notice that the quantity 
formerly known as the normalizing constant, 


Ze" Pix); (26.26) 


becomes max, P*(x). 

Thus the sum—product algorithm can be turned into a max—product algo- 
rithm that computes max, P*(x), and from which the solution of the max- 
imization problem can be deduced. Each ‘marginal’ Z,,(x,,) then lists the 
maximum value that P*(x) can attain for each value of xp. 
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In practice, the max—product algorithm is most often carried out in the 
negative log likelihood domain, where max and product become min and sum. 
The min-sum algorithm is also known as the Viterbi algorithm. 


> 26.4 The junction tree algorithm 


What should one do when the factor graph one is interested in is not a tree? 

There are several options, and they divide into exact methods and approx- 
imate methods. The most widely used exact method for handling marginaliza- 
tion on graphs with cycles is called the junction tree algorithm. This algorithm 
works by agglomerating variables together until the agglomerated graph has 
no cycles. You can probably figure out the details for yourself; the complexity 
of the marginalization grows exponentially with the number of agglomerated 
variables. Read more about the junction tree algorithm in (Lauritzen, 1996; 
Jordan, 1998). 

There are many approximate methods, and we'll visit some of them over 
the next few chapters — Monte Carlo methods and variational methods, to 
name a couple. However, the most amusing way of handling factor graphs 
to which the sum—product algorithm may not be applied is, as we already 
mentioned, to apply the sum—product algorithm! We simply compute the 
messages for each node in the graph, as if the graph were a tree, iterate, and 
cross our fingers. This so-called ‘loopy’ message passing has great importance 
in the decoding of error-correcting codes, and we’ll come back to it in section 
33.8 and Part VI. 


Further reading 


For further reading about factor graphs and the sum—product algorithm, see 
Kschischang et al. (2001), Yedidia et al. (2000), Yedidia et al. (2001a), Yedidia 
et al. (2002), Wainwright et al. (2003), and Forney (2001). 

See also Pearl (1988). A good reference for the fundamental theory of 
graphical models is Lauritzen (1996). A readable introduction to Bayesian 
networks is given by Jensen (1996). 

Interesting message-passing algorithms that have different capabilities from 
the sum—product algorithm include expectation propagation (Minka, 2001) 
and survey propagation (Braunstein et al., 2003). See also section 33.8. 


»> 26.5 Exercises 


> Exercise 26.8. |?! Express the joint probability distribution from the burglar 
alarm and earthquake problem (example 21.1 (p.293)) as a factor graph, 
and find the marginal probabilities of all the variables as each piece of 
information comes to Fred’s attention, using the sum—product algorithm 
with on-the-fly normalization. 
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Laplace’s Method 


The idea behind the Laplace approximation is simple. We assume that an 
unnormalized probability density P*(x), whose normalizing constant 


i= J P*(x) da (27.1) /| 








is of interest, has a peak at a point xg. We Taylor-expand the logarithm of 
P*(x) around this peak: 


In P*(e) = In P* (20) — S(@— a0)? +>, (27.2) | D S 


l (27.3) f \ 


We then approximate P*(x) by an unnormalized Gaussian, 


where 
a? g 
C= — Bx InP (x) 





Q*(x) = P*(xo) exp |-S = zo)?]| i (27.4) 


and we approximate the normalizing constant Zp by the normalizing constant jN 
of this Gaussian, H WN 


ban Pov Pu) a ) i 


We can generalize this integral to approximate Zp for a density P*(x) over 
a K-dimensional space x. If the matrix of second derivatives of — ln P*(x) at 
the maximum xo is A, defined by: 











2 
Ay = — In P* i 27.6 
J Ox jOx; n (x) Ha ( ) 
so that the expansion (27.2) is generalized to 
1 

In P*(x) ~ In P* (xo) — z% — xo)'A(x — xo) +>, (27.7) 

then the normalizing constant can be approximated by: 

z 1 7 Qr)* 

Zp = Zo = P* (xo) = P*(xo) en) : (27.8) 


/ 1 
det zA 


Predictions can be made using the approximation Q. Physicists also call this 
widely-used approximation the saddle-point approximation. 
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The fact that the normalizing constant of a Gaussian is given by 


J d*x exp a 29) CME (27.9) 





det A 


can be proved by making an orthogonal transformation into the basis u in which 
A is transformed into a diagonal matrix. The integral then separates into a 
product of one-dimensional integrals, each of the form 


J du; exp |-| Si [= (27.10) 


The product of the eigenvalues A; is the determinant of A. 


The Laplace approximation is basis-dependent: if x is transformed to a 
nonlinear function u(x) and the density is transformed to P(u) = P(x) |\da/du| 
then in general the approximate normalizing constants Zg will be different. 
This can be viewed as a defect — since the true value Zp is basis-independent 
— or an opportunity — because we can hunt for a choice of basis in which the 
Laplace approximation is most accurate. 


»> 27.1 Exercises 


“Exercise 27.1.[?! (See also exercise 22.8 (p.307).) A photon counter is pointed 
= at a remote star for one minute, in order to infer the rate of photons 
arriving at the counter per minute, A. Assuming the number of photons 

collected r has a Poisson distribution with mean A, 


r 


À 
P(r |à) = exp(-A)—, (27.11) 
r! 
and assuming the improper prior P(A) = 1/A, make Laplace approxima- 


tions to the posterior distribution 


(a) over A 


(b) over logà. [Note the improper prior transforms to P(log A) = 
constant.] 


> Exercise 27.2.!?] Use Laplace’s method to approximate the integral 


Oe J da f(a" (1 — f(a))®, (27.12) 
where f(a) =1/(1+e7%) and u1, ug are positive. Check the accuracy of 
the approximation against the exact answer (23.29, p.316) for (u1, u2) = 
(1/2, 1/2) and (u1,u2) = (1,1). Measure the error (log Zp — log Zg) in 
bits. 


> Exercise 27.3.19] Linear regression. N datapoints {(a™, t()} are generated by 
the experimenter choosing each x), then the world delivering a noisy 
version of the linear function 


y(x) = wo + wiz, (27.13) 
t™ ~ Normal(y(a™), 02). (27.14) 


Assuming Gaussian priors on wọ and w1, make the Laplace approxima- 
tion to the posterior distribution of wo and w; (which is exact, in fact) 
and obtain the predictive distribution for the next datapoint t“™, given 


gi (N41). 


(See MacKay (1992a) for further reading.) 
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Model Comparison and Occam’s Razor 





Figure 28.1. A picture to be 
interpreted. It contains a tree and 
some boxes. 



























































»> 28.1 Occam’s razor 





How many boxes are in the picture (figure 28.1)? In particular, how many 
boxes are in the vicinity of the tree? If we looked with x-ray spectacles, 
would we see one or two boxes behind the trunk (figure 28.2)? (Or even Figure 28.2. How many boxes are 
more?) Occam’s razor is the principle that states a preference for simple behind the tree? 

theories. ‘Accept the simplest explanation that fits the data’. Thus according 

to Occam’s razor, we should deduce that there is only one box behind the tree. 

Is this an ad hoc rule of thumb? Or is there a convincing reason for believing 

there is most likely one box? Perhaps your intuition likes the argument ‘well, 

it would be a remarkable coincidence for the two boxes to be just the same 

height and colour as each other’. If we wish to make artificial intelligences 

that interpret data correctly, we must translate this intuitive feeling into a 

concrete theory. 


ee 














or 2? 





Motivations for Occam’s razor 


If several explanations are compatible with a set of observations, Occam’s 
razor advises us to buy the simplest. This principle is often advocated for one 
of two reasons: the first is aesthetic (‘A theory with mathematical beauty is 
more likely to be correct than an ugly one that fits some experimental data’ 
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(Paul Dirac) ); the second reason is the past empirical success of Occam’s razor. 
However there is a different justification for Occam’s razor, namely: 


Coherent inference (as embodied by Bayesian probability) auto- 
matically embodies Occam’s razor, quantitatively. 


It is indeed more probable that there’s one box behind the tree, and we can 
compute how much more probable one is than two. 


Model comparison and Occam’s razor 
p 


We evaluate the plausibility of two alternative theories Hı and H2 in the light 
of data D as follows: using Bayes’ theorem, we relate the plausibility of model 
Hı given the data, P(Hi|D), to the predictions made by the model about 
the data, P(D | H1), and the prior plausibility of Hı, P(H1). This gives the 
following probability ratio between theory Hı and theory H2: 


P(Hi|D) _ P(H1) P(D| 71) (28.1) 


P(H2|D) P(H2) P(D| Hae) 


The first ratio (P(H1)/P(H2)) on the right-hand side measures how much our 
initial beliefs favoured Hı over H2. The second ratio expresses how well the 
observed data were predicted by 1, compared to Ho. 

How does this relate to Occam’s razor, when H1 is a simpler model than 
H2? The first ratio (P(H1)/P(H2)) gives us the opportunity, if we wish, to 
insert a prior bias in favour of Hı on aesthetic grounds, or on the basis of 
experience. This would correspond to the aesthetic and empirical motivations 
for Occam’s razor mentioned earlier. But such a prior bias is not necessary: 
the second ratio, the data-dependent factor, embodies Occam’s razor auto- 
matically. Simple models tend to make precise predictions. Complex models, 
by their nature, are capable of making a greater variety of predictions (figure 
28.3). So if Hə is a more complex model, it must spread its predictive proba- 
bility P(D|H2) more thinly over the data space than Hi. Thus, in the case 
where the data are compatible with both theories, the simpler Hı will turn out 
more probable than 7/2, without our having to express any subjective dislike 
for complex models. Our subjective prior just needs to assign equal prior prob- 
abilities to the possibilities of simplicity and complexity. Probability theory 
then allows the observed data to express their opinion. 

Let us turn to a simple example. Here is a sequence of numbers: 





At, Soe Ts 


The task is to predict the next two numbers, and infer the underlying process 
that gave rise to this sequence. A popular answer to this question is the 
prediction ‘15, 19’, with the explanation ‘add 4 to the previous number’. 
What about the alternative answer ‘—19.9,1043.8’ with the underlying 
rule being: ‘get the next number from the previous number, x, by evaluating 


Figure 28.3. Why Bayesian 
inference embodies Occam’s razor. 
This figure gives the basic 
intuition for why complex models 
can turn out to be less probable. 
The horizontal axis represents the 
space of possible data sets D. 
Bayes’ theorem rewards models in 
proportion to how much they 
predicted the data that occurred. 
These predictions are quantified 
by a normalized probability 
distribution on D. This 
probability of the data given 
model H;, P(D|H4), is called the 
evidence for Hi. 

A simple model Hı makes only a 
limited range of predictions, 
shown by P(D | Hı); a more 
powerful model H2, that has, for 
example, more free parameters 
than 71, is able to predict a 
greater variety of data sets. This 
means, however, that 7/2 does not 
predict the data sets in region Cı 
as strongly as Hı. Suppose that 
equal prior probabilities have been 
assigned to the two models. Then, 
if the data set falls in region C1, 
the less powerful model Hı will be 
the more probable model. 
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—z?/11 + 9/11a? + 23/11’? I assume that this prediction seems rather less 
plausible. But the second rule fits the data (—1, 3, 7, 11) just as well as the 
rule ‘add 4’. So why should we find it less plausible? Let us give labels to the 
two general theories: 


Ha — the sequence is an arithmetic progression, ‘add n’, where n is an integer. 


He — the sequence is generated by a cubic function of the form x —> cx? + 
dx? + e, where c, d and e are fractions. 


One reason for finding the second explanation, He, less plausible, might be 
that arithmetic progressions are more frequently encountered than cubic func- 
tions. This would put a bias in the prior probability ratio P(Ha)/P(He) in 
equation (28.1). But let us give the two theories equal prior probabilities, and 
concentrate on what the data have to say. How well did each theory predict 
the data? 

To obtain P(D | Ha) we must specify the probability distribution that each 
model assigns to its parameters. First, Ha depends on the added integer n, 
and the first number in the sequence. Let us say that these numbers could 
each have been anywhere between —50 and 50. Then since only the pair of 
values {n =4, first number = — 1} give rise to the observed data D = (—1, 3, 
7, 11), the probability of the data, given Ha, is: 


1 1 
P(D | Ha) == 101 101 = 0.00010. (28.2) 


To evaluate P(D|H-.), we must similarly say what values the fractions c,d 
and e might take on. [I choose to represent these numbers as fractions rather 
than real numbers because if we used real numbers, the model would assign, 
relative to Ha, an infinitesimal probability to D. Real parameters are the 
norm however, and are assumed in the rest of this chapter.] A reasonable 
prior might state that for each fraction the numerator could be any number 
between —50 and 50, and the denominator is any number between 1 and 50. 
As for the initial value in the sequence, let us leave its probability distribution 
the same as in Ha. There are four ways of expressing the fraction c = —1/11 = 
—2/22 = —3/33 = —4/44 under this prior, and similarly there are four and two 
possible solutions for d and e, respectively. So the probability of the observed 
data, given He, is found to be: 


Rate) = (a) (i =) (5 =) (i =) 


= 0.0000000000025 = 2.5 x 107!. (28.3) 





Thus comparing P(D|H_) with P(D |Ha) = 0.00010, even if our prior prob- 
abilities for Ha and He are equal, the odds, P(D |Ha) : P(D|H-), in favour 
of Ha over He, given the sequence D = (—1, 3, 7, 11), are about forty million 
to one. Oo 

This answer depends on several subjective assumptions; in particular, the 
probability assigned to the free parameters n, c, d, e of the theories. Bayesians 
make no apologies for this: there is no such thing as inference or prediction 
without assumptions. However, the quantitative details of the prior proba- 
bilities have no effect on the qualitative Occam’s razor effect; the complex 
theory He always suffers an ‘Occam factor’ because it has more parameters, 
and so can predict a greater variety of data sets (figure 28.3). This was only 
a small example, and there were only four data points; as we move to larger 
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models 


gather next 





cally increases, and the degree to which our inferences are influenced by the 
quantitative details of our subjective assumptions becomes smaller. 


Bayesian methods and data analysis 


Let us now relate the discussion above to real problems in data analysis. 

There are countless problems in science, statistics and technology which 
require that, given a limited data set, preferences be assigned to alternative 
models of differing complexities. For example, two alternative hypotheses 
accounting for planetary motion are Mr. Inquisition’s geocentric model based 
on ‘epicycles’, and Mr. Copernicus’s simpler model of the solar system with 
the sun at the centre. The epicyclic model fits data on planetary motion at 
least as well as the Copernican model, but does so using more parameters. 
Coincidentally for Mr. Inquisition, two of the extra epicyclic parameters for 
every planet are found to be identical to the period and radius of the sun’s 
‘cycle around the earth’. Intuitively we find Mr. Copernicus’s theory more 
probable. 


The mechanism of the Bayesian razor: the evidence and the Occam factor 


Two levels of inference can often be distinguished in the process of data mod- 
elling. At the first level of inference, we assume that a particular model is true, 
and we fit that model to the data, i.e., we infer what values its free param- 
eters should plausibly take, given the data. The results of this inference are 
often summarized by the most probable parameter values, and error bars on 
those parameters. This analysis is repeated for each model. The second level 
of inference is the task of model comparison. Here we wish to compare the 
models in the light of the data, and assign some sort of preference or ranking 
to the alternatives. 


Note that both levels of inference are distinct from decision theory. The goal 
of inference is, given a defined hypothesis space and a particular data set, to 
assign probabilities to hypotheses. Decision theory typically chooses between 
alternative actions on the basis of these probabilities so as to minimize the 


Figure 28.4. Where Bayesian 
inference fits into the data 
modelling process. 

This figure illustrates an 
abstraction of the part of the 
scientific process in which data 
are collected and modelled. In 
particular, this figure applies to 
pattern classification, learning, 
interpolation, etc. The two 
double-framed boxes denote the 
two steps which involve inference. 
It is only in those two steps that 
Bayes’ theorem can be used. 
Bayes does not tell you how to 
invent models, for example. 

The first box, ‘fitting each model 
to the data’, is the task of 
inferring what the model 
parameters might be given the 
model and the data. Bayesian 
methods may be used to find the 
most probable parameter values, 
and error bars on those 
parameters. The result of 
applying Bayesian methods to this 
problem is often little different 
from the answers given by 
orthodox statistics. 

The second inference task, model 
comparison in the light of the 
data, is where Bayesian methods 
are in a class of their own. This 
second inference problem requires 
a quantitative Occam’s razor to 
penalize over-complex models. 
Bayesian methods can assign 
objective preferences to the 
alternative models in a way that 
automatically embodies Occam’s 
razor. 
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expectation of a ‘loss function’. This chapter concerns inference alone and no 
loss functions are involved. When we discuss model comparison, this should 
not be construed as implying model choice. Ideal Bayesian predictions do not 
involve choice between models; rather, predictions are made by summing over 
all the alternative models, weighted by their probabilities. 


Bayesian methods are able consistently and quantitatively to solve both 
the inference tasks. There is a popular myth that states that Bayesian meth- 
ods differ from orthodox statistical methods only by the inclusion of subjective 
priors, which are difficult to assign, and which usually don’t make much dif- 
ference to the conclusions. It is true that, at the first level of inference, a 
Bayesian’s results will often differ little from the outcome of an orthodox at- 
tack. What is not widely appreciated is how a Bayesian performs the second 
level of inference; this chapter will therefore focus on Bayesian model compar- 
ison. 

Model comparison is a difficult task because it is not possible simply to 
choose the model that fits the data best: more complex models can always 
fit the data better, so the maximum likelihood model choice would lead us 
inevitably to implausible, over-parameterized models, which generalize poorly. 
Occam’s razor is needed. 

Let us write down Bayes’ theorem for the two levels of inference described 
above, so as to see explicitly how Bayesian model comparison works. Each 
model H; is assumed to have a vector of parameters w. A model is defined 
by a collection of probability distributions: a ‘prior’ distribution P(w | H:i), 
which states what values the model’s parameters might be expected to take; 
and a set of conditional distributions, one for each value of w, defining the 
predictions P(D |w, H;) that the model makes about the data D. 


1. Model fitting. At the first level of inference, we assume that one model, 
the ith, say, is true, and we infer what the model’s parameters w might 
be, given the data D. Using Bayes’ theorem, the posterior probability 
of the parameters w is: 

P(D |w, Hi) P(w | Hi) 


P(w|D,H;) = > “plese (28.4) 


that is, 





; Likelihood x Prior 
Posterior = - . 
Evidence 

The normalizing constant P(D |H;) is commonly ignored since it is irrel- 
evant to the first level of inference, i.e., the inference of w; but it becomes 
important in the second level of inference, and we name it the evidence 
for H;. It is common practice to use gradient-based methods to find the 
maximum of the posterior, which defines the most probable value for the 
parameters, Wyp; it is then usual to summarize the posterior distribution 
by the value of wyp, and error bars or confidence intervals on these best- 
fit parameters. Error bars can be obtained from the curvature of the pos- 
terior; evaluating the Hessian at wup, A = —VV In P(w | D, Hi)lwyp> 
and Taylor-expanding the log posterior probability with Aw = w—wyp: 


P(w | D, Hi) ~ P(wme | D, Hi) exp (—!/2Aw'AAw) ; (28.5) 


we see that the posterior can be locally approximated as a Gaussian 
with covariance matrix (equivalent to error bars) A~+. [Whether this 
approximation is good or not will depend on the problem we are solv- 
ing. Indeed, the maximum and mean of the posterior distribution have 
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ty Figure 28.5. The Occam factor. 
This figure shows the quantities 

P(w|D,Hi) that determine the Occam factor 
for a hypothesis H; having a 
single parameter w. The prior 











'Ow|D\ 
P(w| Hi) ——= distribution (solid line) for the 
Tae parameter has width ow. The 
eae je es > posterior distribution (dashed 
< > w line) has a single peak at Wmp 


with characteristic width owp. 
The Occam factor is 


no fundamental status in Bayesian inference — they both change under Ow|D 
nonlinear reparameterizations. Maximization of a posterior probabil- w|DP (War | Hi) = ——. 
ity is useful only if an approximation like equation (28.5) gives a good 

summary of the distribution.] 


Ow 


2. Model comparison. At the second level of inference, we wish to infer 
which model is most plausible given the data. The posterior probability 
of each model is: 


Notice that the data-dependent term P(D|H;) is the evidence for Hi, 
which appeared as the normalizing constant in (28.4). The second term, 
P(H;), is the subjective prior over our hypothesis space, which expresses 
how plausible we thought the alternative models were before the data 
arrived. Assuming that we choose to assign equal priors P(H;) to the 
alternative models, models H; are ranked by evaluating the evidence. The 
normalizing constant P(D) = $; P(D|H;)P(H;) has been omitted from 
equation (28.6) because in the data-modelling process we may develop 
new models after the data have arrived, when an inadequacy of the first 
models is detected, for example. Inference is open ended: we continually 
seek more probable models to account for the data we gather. 


To repeat the key idea: to rank alternative models H;, a Bayesian eval- 
uates the evidence P(D|H;). This concept is very general: the ev- 
idence can be evaluated for parametric and ‘non-parametric’ models 
alike; whatever our data-modelling task, a regression problem, a clas- 
sification problem, or a density estimation problem, the evidence is a 
transportable quantity for comparing alternative models. In all these 
cases the evidence naturally embodies Occam’s razor. 


Evaluating the evidence 


Let us now study the evidence more closely to gain insight into how the 
Bayesian Occam’s razor works. The evidence is the normalizing constant for 
equation (28.4): 


P(D|Hi) = J P@lw.1)Pew | Haw, (28.7) 


For many problems the posterior P(w|D,H;) x P(D|w,H;)P(w|H;) has 
a strong peak at the most probable parameters wyp (figure 28.5). Then, 
taking for simplicity the one-dimensional case, the evidence can be approx- 
imated, using Laplace’s method, by the height of the peak of the integrand 
P(D|w,H;)P(w|H;) times its width, owp: 
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eS oT’ 
Evidence ~ Best fit likelihood x Occam factor 


Thus the evidence is found by taking the best-fit likelihood that the model 
can achieve and multiplying it by an ‘Occam factor’, which is a term with 
magnitude less than one that penalizes H; for having the parameter w. 


Interpretation of the Occam factor 


The quantity o,,)p is the posterior uncertainty in w. Suppose for simplicity 
that the prior P(w |H,;) is uniform on some large interval ow, representing the 
range of values of w that were possible a priori, according to H; (figure 28.5). 
Then P(wyp | Hi) = 1/0w, and 

Occam factor = Sub (28.9) 

Ow 

i.e., the Occam factor is equal to the ratio of the posterior accessible volume 
of H; ’s parameter space to the prior accessible volume, or the factor by which 
H,’s hypothesis space collapses when the data arrive. The model H; can be 
viewed as consisting of a certain number of exclusive submodels, of which only 
one survives when the data arrive. The Occam factor is the inverse of that 
number. The logarithm of the Occam factor is a measure of the amount of 
information we gain about the model’s parameters when the data arrive. 

A complex model having many parameters, each of which is free to vary 
over a large range Oy, will typically be penalized by a stronger Occam factor 
than a simpler model. The Occam factor also penalizes models that have to 
be finely tuned to fit the data, favouring models for which the required pre- 
cision of the parameters owp is coarse. The magnitude of the Occam factor 
is thus a measure of complexity of the model; it relates to the complexity of 
the predictions that the model makes in data space. This depends not only 
on the number of parameters in the model, but also on the prior probability 
that the model assigns to them. Which model achieves the greatest evidence 
is determined by a trade-off between minimizing this natural complexity mea- 
sure and minimizing the data misfit. In contrast to alternative measures of 
model complexity, the Occam factor for a model is straightforward to evalu- 
ate: it simply depends on the error bars on the parameters, which we already 
evaluated when fitting the model to the data. 

Figure 28.6 displays an entire hypothesis space so as to illustrate the var- 
ious probabilities in the analysis. There are three models, H1, H2, H3, which 
have equal prior probabilities. Each model has one parameter w (each shown 
on a horizontal axis), but assigns a different prior range ow to that parame- 
ter. H3 is the most ‘flexible’ or ‘complex’ model, assigning the broadest prior 
range. A one-dimensional data space is shown by the vertical axis. Each 
model assigns a joint probability distribution P(D,w|H;) to the data and 
the parameters, illustrated by a cloud of dots. These dots represent random 
samples from the full probability distribution. The total number of dots in 
each of the three model subspaces is the same, because we assigned equal prior 
probabilities to the models. 

When a particular data set D is received (horizontal line), we infer the pos- 
terior distribution of w for a model (H3, say) by reading out the density along 
that horizontal line, and normalizing. The posterior probability P(w|D,H3) 
is shown by the dotted curve at the bottom. Also shown is the prior distribu- 
tion P(w | H3) (cf. figure 28.5). [In the case of model Hı which is very poorly 
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Figure 28.6. A hypothesis space 
consisting of three exclusive 


P(D|#Hs) Š models, each having one 
a8 parameter w, and a 
: ; one-dimensional data set D. The 
P(D|Hz2) an ae ‘data set’ is a single measured 
Diet AETS EI Joneses sere wyone-n E ae fe --------------- value which differs from the 


parameter w by a small amount 
z of additive noise. Typical samples 
ie from the joint distribution 
P(D,w,#H) are shown by dots. 
(N.B., these are not data points.) 
The observed ‘data set’ is a single 
particular value for D shown by 
the dashed horizontal line. The 
dashed curves below show the 


r, 





\ 





U P(w] D, H3)! 





























posterior probability of w for each 
' : model given this data set (cf. 
Dy P(w | Hs) l : figure 28.3). The evidence for the 
pan different models is obtained by 





Ow|D marginalizing onto the D axis at 
the left-hand side (cf. figure 28.5). 





matched to the data, the shape of the posterior distribution will depend on 
the details of the tails of the prior P(w | H1) and the likelihood P(D |w, H1); 
the curve shown is for the case where the prior falls off more strongly.] 

We obtain figure 28.3 by marginalizing the joint distributions P(D, w | Hi) 
onto the D axis at the left-hand side. For the data set D shown by the dotted 
horizontal line, the evidence P(D | H3) for the more flexible model H3 has 
a smaller value than the evidence for H2. This is because H3 placed less 
predictive probability (fewer dots) on that line. In terms of the distributions 
over w, model H3 has smaller evidence because the Occam factor oy) p/w is 
smaller for H3 than for H2. The simplest model Hı has the smallest evidence 
of all, because the best fit that it can achieve to the data D is very poor. 
Given this data set, the most probable model is H2. 


Occam factor for several parameters 


If the posterior is well approximated by a Gaussian, then the Occam factor 
is obtained from the determinant of the corresponding covariance matrix (cf. 
equation (28.8) and Chapter 27): 


P(D|Hi) ~  P(D| wap, Hi) x P(wur | Hi) det? (A/27), (28.10) 
——— —— M aaŮĒ 
Evidence ~ Best fit likelihood x Occam factor 


where A = -VV ln P(w | D,H;), the Hessian which we evaluated when we 
calculated the error bars on Wyp (equation 28.5 and Chapter 27). As the 
amount of data collected increases, this Gaussian approximation is expected 
to become increasingly accurate. 

In summary, Bayesian model comparison is a simple extension of maximum 
likelihood model selection: the evidence is obtained by multiplying the best-fit 
likelihood by the Occam factor. 

To evaluate the Occam factor we need only the Hessian A, if the Gaussian 
approximation is good. Thus the Bayesian method of model comparison by 
evaluating the evidence is no more computationally demanding than the task 
of finding for each model the best-fit parameters and their error bars. 
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»> 28.2 Example 


Let’s return to the example that opened this chapter. Are there one or two 
boxes behind the tree in figure 28.1? Why do coincidences make us suspicious? fe 

Let’s assume the image of the area round the trunk and box has a size 1 6 
of 50 pixels, that the trunk is 10 pixels wide, and that 16 different colours of 1? 
boxes can be distinguished. The theory Hı that says there is one box near g 














the trunk has four free parameters: three coordinates defining the top three 
edges of the box, and one parameter giving the box’s colour. (If boxes could 
levitate, there would be five free parameters.) 

The theory Hə that says there are two boxes near the trunk has eight free 
parameters (twice four), plus a ninth, a binary variable that indicates which 
of the two boxes is the closest to the viewer. 

What is the evidence for each model? We’ll do H1 first. We need a prioron Figure 28.7. How many boxes are 

the parameters to evaluate the evidence. For convenience, let’s work in pixels. behind the tree? 
Let’s assign a separable prior to the horizontal location of the box, its width, 
its height, and its colour. The height could have any of, say, 20 distinguishable 
values, so could the width, and so could the location. The colour could have 
any of 16 values. We’ll put uniform priors over these variables. We’ll ignore 
all the parameters associated with other objects in the image, since they don’t 
come into the model comparison between Hı and H2. The evidence is 

















or 2? 





1 111 
PUD ht) nne (28.11) 
since only one setting of the parameters fits the data, and it predicts the data 
perfectly. 

As for model Hg, six of its nine parameters are well-determined, and three 
of them are partly-constrained by the data. If the left-hand box is furthest 
away, for example, then its width is at least 8 pixels and at most 30; if it’s 
the closer of the two boxes, then its width is between 8 and 18 pixels. (I’m 
assuming here that the visible portion of the left-hand box is about 8 pixels 
wide.) To get the evidence we need to sum up the prior probabilities of all 
viable hypotheses. To do an exact calculation, we need to be more specific 
about the data and the priors, but let’s just get the ballpark answer, assuming 
that the two unconstrained real variables have half their values available, and 
that the binary variable is completely undetermined. (As an exercise, you can 
make an explicit model and work out the exact answer.) 

11101111012 


P(D = ; 28.12 
Ua) 20 20 20 16 20 20 20 16 2 pet) 





Thus the posterior probability ratio is (assuming equal prior probability): 


P(D|H)P(H) 1 
P(D|H2)P(H2) ioe 


1 
= 20x2x2x16 ~ 1000/1. (28.14) 


So the data are roughly 1000 to 1 in favour of the simpler hypothesis. The 
four factors in (28.13) can be interpreted in terms of Occam factors. The more 
complex model has four extra parameters for sizes and colours — three for sizes, 
and one for colour. It has to pay two big Occam factors (1/20 and 1/16) for the 
highly suspicious coincidences that the two box heights match exactly and the 
two colours match exactly; and it also pays two lesser Occam factors for the 
two lesser coincidences that both boxes happened to have one of their edges 
conveniently hidden behind a tree or behind each other. 
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Hy: = |L(H1) E(w | H1) L(D| wp H1) Figure 28.8. A popular view of 
model comparison by minimum 
He: |L(H2) L(w() | H2) L(D | w/a), H2) description length. Each model 
Hi communicates the data D by 
Hs: |L(H3) L(w/s) | Hs) sending the identity of the model, 


sending the best-fit parameters of 
the model w*, then sending the 


> 28.3 Minimum description length (MDL) E E 


A complementary view of Bayesian model comparison is obtained by replacing models the -length of ane 
parameter message increases. On 


probabilities of events by the lengths in bits of messages that communicate the other hand. the length of the 
the events without loss to a receiver. Message lengths L(x) correspond toa data message decreases, because a 


probabilistic model over events x via the relations: complex model is able to fit the 
ie) data better, making the residuals 
P(x) =2 , L(x) = — log P(x). (28.15) smaller. In this example the 
int diat del H2 achieves 
The MDL principle (Wallace and Boulton, 1968) states that one should een eae, 
prefer models that can communicate the data in the smallest number of bits. these two trends. 


Consider a two-part message that states which model, H, is to be used, and 
then communicates the data D within that model, to some pre-arranged pre- 
cision 6D. This produces a message of length L(D, H) = L(H) + L(D|H). 
The lengths L(H) for different H define an implicit prior P(H) over the alter- 
native models. Similarly L(D |H) corresponds to a density P(D |H). Thus, a 
procedure for assigning message lengths can be mapped onto posterior prob- 
abilities: 


L(D,H) = —log P(H) — log (P(D|H)dD) (28.16) 
= —log P(H |D) + const. (28.17) 


In principle, then, MDL can always be interpreted as Bayesian model compar- 
ison and vice versa. However, this simple discussion has not addressed how 
one would actually evaluate the key data-dependent term L(D|H), which 
corresponds to the evidence for H. Often, this message is imagined as being 
subdivided into a parameter block and a data block (figure 28.8). Models with 
a small number of parameters have only a short parameter block but do not 
fit the data well, and so the data message (a list of large residuals) is long. As 
the number of parameters increases, the parameter block lengthens, and the 
data message becomes shorter. There is an optimum model complexity (He 
in the figure) for which the sum is minimized. 

This picture glosses over some subtle issues. We have not specified the 
precision to which the parameters w should be sent. This precision has an 
important effect (unlike the precision ôD to which real-valued data D are 
sent, which, assuming ôD is small relative to the noise level, just introduces 
an additive constant). As we decrease the precision to which w is sent, the 
parameter message shortens, but the data message typically lengthens because 
the truncated parameters do not match the data so well. There is a non-trivial 
optimal precision. In simple Gaussian cases it is possible to solve for this 
optimal precision (Wallace and Freeman, 1987), and it is closely related to the 
posterior error bars on the parameters, A~', where A = -VV In P(w|D,H). 
It turns out that the optimal parameter message length is virtually identical to 
the log of the Occam factor in equation (28.10). (The random element involved 
in parameter truncation means that the encoding is slightly sub-optimal.) 

With care, therefore, one can replicate Bayesian results in MDL terms. 
Although some of the earliest work on complex model comparison involved 
the MDL framework (Patrick and Wallace, 1982), MDL has no apparent ad- 
vantages over the direct probabilistic approach. 
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MDL does have its uses as a pedagogical tool. The description length 
concept is useful for motivating prior probability distributions. Also, different 


ways of breaking down the task of communicating data using a model can give 
helpful insights into the modelling process, as will now be illustrated. 


On-line learning and cross-validation. 


In cases where the data consist of a sequence of points D = t® ,t@,... +), 
the log evidence can be decomposed as a sum of ‘on-line’ predictive perfor- 
mances: 


log P(D|H) = log P(t |H) + log P(t?) tH) 
+log P(t It, t®,H) +--+ log P(t™ t® ye tD H). (28.18) 


This decomposition can be used to explain the difference between the ev- 
idence and ‘leave-one-out cross-validation’ as measures of predictive abil- 
ity. Cross-validation examines the average value of just the last term, 
log P(t\™ |t® ... t, H), under random re-orderings of the data. The evi- 
dence, on the other hand, sums up how well the model predicted all the data, 
starting from scratch. 


The ‘bits-back’ encoding method. 


Another MDL thought experiment (Hinton and van Camp, 1993) involves in- 
corporating random bits into our message. The data are communicated using a 
parameter block and a data block. The parameter vector sent is a random sam- 
ple from the posterior, P(w | D, H) = P(D|w,H)P(w|H)/P(D|H). This 
sample w is sent to an arbitrary small granularity ôw using a message length 
L(w|H) = —log[P(w|H)dw]. The data are encoded relative to w with a 
message of length L(D |w, H) = —log|P(D|w,H)dD]. Once the data mes- 
sage has been received, the random bits used to generate the sample w from 
the posterior can be deduced by the receiver. The number of bits so recov- 
ered is —log|P(w | D, H)ôw]. These recovered bits need not count towards the 
message length, since we might use some other optimally-encoded message as 
a random bit string, thereby communicating that message at the same time. 
The net description cost is therefore: 


P(w|H)P(D|w,H) ôD 
P(w|D,H) 
= —log P(D|H)-—logôD. (28.19) 


L(w | H) F L(D |w, H) jasa ‘Bits back’ = — log 


Thus this thought experiment has yielded the optimal description length. Bits- 
back encoding has been turned into a practical compression method for data 
modelled with latent variable models by Frey (1998). 


Further reading 


Bayesian methods are introduced and contrasted with sampling-theory statis- 
tics in (Jaynes, 1983; Gull, 1988; Loredo, 1990). The Bayesian Occam’s razor 
is demonstrated on model problems in (Gull, 1988; MacKay, 1992a). Useful 
textbooks are (Box and Tiao, 1973; Berger, 1985). 

One debate worth understanding is the question of whether it’s permis- 
sible to use improper priors in Bayesian inference (Dawid et al., 1996). If 
we want to do model comparison (as discussed in this chapter), it is essen- 
tial to use proper priors — otherwise the evidences and the Occam factors are 
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meaningless. Only when one has no intention to do model comparison may 
it be safe to use improper priors, and even in such cases there are pitfalls, as 
Dawid et al. explain. I would agree with their advice to always use proper 
priors, tempered by an encouragement to be smart when making calculations, 
recognizing opportunities for approximation. 


> 28.4 Exercises 


Exercise 28.1.13] Random variables x come independently from a probability 
distribution P(x). According to model Ho, P(x) is a uniform distribu- 




















tion i Pii To) 
P(x|Ho)= = x E€ (—1,1). (28.20) z” 
2 —1 1 
According to model H1, P(x) is a nonuniform distribution with an un- 
known parameter m € (—1, 1): P(x|m=—0.4, H1) 
1 somes 
P(x|m, H1) = z0 + mz) x € (-1,1). (28.21) =A = 
Given the data D = {0.3, 0.5, 0.7,0.8,0.9}, what is the evidence for Ho 
and H1? 
Exercise 28.2.15] Datapoints (x,t) are believed to come from a straight line. 
The experimenter chooses x, and t is Gaussian-distributed about y = wo +wx 
Y = wo + wiz (28.22) p ‘ 
with variance a2. According to model H1, the straight line is horizontal, T” 


so w1 = 0. According to model H2, w1 is a parameter with prior distribu- 
tion Normal(0,1). Both models assign a prior distribution Normal(0, 1) 
to wo. Given the data set D = {(—8,8), (—2, 10), (6, 11)}, and assuming 
the noise level is 0, = 1, what is the evidence for each model? 


Exercise 28.3.[9] A six-sided die is rolled 30 times and the numbers of times 
each face came up were F = {3,3,2,2,9,11}. What is the probability 
that the die is a perfectly fair die (‘Ho’), assuming the alternative hy- 
pothesis Hı says that the die has a biased distribution p, and the prior 
density for p is uniform over the simplex p; > 0, 57, pi = 1? 


Solve this problem two ways: exactly, using the helpful Dirichlet formu- 
lae (23.30, 23.31), and approximately, using Laplace’s method. Notice 
that your choice of basis for the Laplace approximation is important. 
See MacKay (1998a) for discussion of this exercise. 


Exercise 28.4.19] The influence of race on the imposition of the death penalty 
for murder in America has been much studied. The following three-way 
table classifies 326 cases in which the defendant was convicted of mur- 
der. The three variables are the defendant’s race, the victim’s race, and 
whether the defendant was sentenced to death. (Data from M. Radelet, 
‘Racial characteristics and imposition of the death penalty,’ American 
Sociological Review, 46 (1981), pp. 918-927.) 











White defendant Black defendant 
Death penalty Death penalty 
Yes No Yes No 
White victim 19 132 White victim 11 52 


Black victim 0 9 Black victim 6 97 
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It seems that the death penalty was applied much more often when the Hi Hio 
victim was white then when the victim was black. When the victim was C) (m) C) (m) 
white 14% of defendants got the death penalty, but when the victim was 


black 6% of defendants got the death penalty. [Incidentally, these data 








provide an example of a phenomenon known as Simpson’s paradox: a (a) (4) 
higher fraction of white defendants are sentenced to death overall, but 

in cases involving black victims a higher fraction of black defendants are Hoi Hoo 
sentenced to death and in cases involving white victims a higher fraction © (m) © (m) 


of black defendants are sentenced to death.] 














Quantify the evidence for the four alternative hypotheses shown in fig- (a) @ 
ure 28.9. I should mention that I don’t believe any of these models is 

adequate: several additional variables are important in murder cases, 

such as whether the victim and murderer knew each other, whether the Figure 28.9. Four hypotheses 
murder was premeditated, and whether the defendant had a prior crim- Concerning the dependence of the 
inal record; none of these variables is included in the table. So this is ee ae q 
an academic exercise in model comparison rather than a serious study 


the race of the convicted murderer 
of racial bias in the state of Florida. m. Hoi, for example, asserts that 





the probability of receiving the 
death penalty does depend on the 
murderer’s race, but not on the 
victim’s. 


The hypotheses are shown as graphical models, with arrows showing 
dependencies between the variables v (victim race), m (murderer race), 
and d (whether death penalty given). Model Hoo has only one free 
parameter, the probability of receiving the death penalty; model H11 has 
four such parameters, one for each state of the variables v and m. Assign 
uniform priors to these variables. How sensitive are the conclusions to 
the choice of prior? 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


About Chapter 29 


The last couple of chapters have assumed that a Gaussian approximation to 
the probability distribution we are interested in is adequate. What if it is not? 
We have already seen an example — clustering — where the likelihood function 
is multimodal, and has nasty unboundedly-high spikes in certain locations in 
the parameter space; so maximizing the posterior probability and fitting a 
Gaussian is not always going to work. This difficulty with Laplace’s method is 
one motivation for being interested in Monte Carlo methods. In fact, Monte 
Carlo methods provide a general-purpose set of tools with applications in 
Bayesian data modelling and many other fields. 

This chapter describes a sequence of methods: importance sampling, re- 
jection sampling, the Metropolis method, Gibbs sampling and slice sampling. 
For each method, we discuss whether the method is expected to be useful for 
high-dimensional problems such as arise in inference with graphical models. 
[A graphical model is a probabilistic model in which dependencies and inde- 
pendencies of variables are represented by edges in a graph whose nodes are 
the variables.] Along the way, the terminology of Markov chain Monte Carlo 
methods is presented. The subsequent chapter discusses advanced methods 
for reducing random walk behaviour. 

For details of Monte Carlo methods, theorems and proofs and a full list 
of references, the reader is directed to Neal (1993b), Gilks et al. (1996), and 
Tanner (1996). 

In this chapter I will use the word ‘sample’ in the following sense: a sample 
from a distribution P(x) is a single realization x whose probability distribution 
is P(x). This contrasts with the alternative usage in statistics, where ‘sample’ 
refers to a collection of realizations {x}. 

When we discuss transition probability matrices, I will use a right-multipli- 
cation convention: I like my matrices to act to the right, preferring 


u= Mv (29.1) 


to 
u' = v'M'. (29.2) 


A transition probability matrix T;j or Tj; specifies the probability, given the 
current state is j, of making the transition from j to i. The columns of T are 
probability vectors. If we write down a transition probability density, we use 
the same convention for the order of its arguments: T(x’; x) is a transition 
probability density from x to x’. This unfortunately means that you have 
to get used to reading from right to left — the sequence xyz has probability 


T(z; y)T(y; 2)n(z). 


306 
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29 


Monte Carlo Methods 


»> 29.1 The problems to be solved 


Monte Carlo methods are computational techniques that make use of random 
numbers. The aims of Monte Carlo methods are to solve one or both of the 
following problems. 


Problem 1: to generate samples {xO from a given probability distribu- 
tion P(x). 


Problem 2: to estimate expectations of functions under this distribution, for 
example 


d = (¢(x)) = if dx P(x)d(x). (29.3) 


The probability distribution P(x), which we call the target density, might 
be a distribution from statistical physics or a conditional distribution arising 
in data modelling — for example, the posterior probability of a model’s pa- 
rameters given some observed data. We will generally assume that x is an 
N-dimensional vector with real components zn, but we will sometimes con- 
sider discrete spaces also. 

Simple examples of functions ¢(x) whose expectations we might be inter- 
ested in include the first and second moments of quantities that we wish to 
predict, from which we can compute means and variances; for example if some 
quantity t depends on x, we can find the mean and variance of t under P(x) 
by finding the expectations of the functions ¢(x) = t(x) and $2(x) = (t(x))?, 


®, = €[d1(x)] and 2 = E[¢2(x)], (29.4) 
then using 
==, and var(t) = z — &?. (29.5) 


It is assumed that P(x) is sufficiently complex that we cannot evaluate these 
expectations by exact methods; so we are interested in Monte Carlo methods. 

We will concentrate on the first problem (sampling), because if we have 
solved it, then we can solve the second problem by using the random samples 
{xM} to give the estimator 


ê 


1 r 
a a o(x). (29.6) 


If the vectors {x()}#_, are generated from P(x) then the expectation of Ê is 


®. Also, as the number of samples R increases, the variance of ® will decrease 
as o”/R, where g? is the variance of ¢, 


o? = J dx P(x)(¢(x) - ®)?. (29.7) 
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3F 7 poen So ere Figure 29.1. (a) The function 
Px) = 
a | Hi ean ] exp[0.4(a — 0.4)? — 0.08xt]. How 
Pa | Hl) SAL. | to draw samples from this 
| density? (b) The function P*(x) 
154 | 4 154 4 evaluated at a discrete set of 
| uniformly spaced points {2;}. 
TR | HE ] How to draw samples from this 
F. discrete distribution? 
0.5 b 4 o5f+ J 
0 1 1 1 1 1 0 1 L | Jl ii | | | l 1 
-4 -2 0 2 -4 -2 0 2 4 


This is one of the important properties of Monte Carlo methods. 





The accuracy of the Monte Carlo estimate (29.6) depends only on 
the variance of ¢, not on the dimensionality of the space sampled. 
To be precise, the variance of ® goes as a?/R. So regardless of the 


dimensionality of x, it may be that as few as a dozen independent 
samples {x} suffice to estimate ® satisfactorily. 





We will find later, however, that high dimensionality can cause other diffi- 
culties for Monte Carlo methods. Obtaining independent samples from a given 
distribution P(x) is often not easy. 


Why is sampling from P(x) hard? 


We will assume that the density from which we wish to draw samples, P(x), 
can be evaluated, at least to within a multiplicative constant; that is, we can 
evaluate a function P*(x) such that 


P(x) = P*(x)/Z. (29.8) 


If we can evaluate P*(x), why can we not easily solve problem 1? Why is it in 
general difficult to obtain samples from P(x)? There are two difficulties. The 
first is that we typically do not know the normalizing constant 


Z= [aks P*(x). (29.9) 


The second is that, even if we did know Z, the problem of drawing samples 
from P(x) is still a challenging one, especially in high-dimensional spaces, 
because there is no obvious way to sample from P without enumerating most 
or all of the possible states. Correct samples from P will by definition tend 
to come from places in x-space where P(x) is big; how can we identify those 
places where P(x) is big, without evaluating P(x) everywhere? There are only 
a few high-dimensional densities from which it is easy to draw samples, for 
example the Gaussian distribution. 

Let us start with a simple one-dimensional example. Imagine that we wish 
to draw samples from the density P(x) = P*(x)/Z where 


P*(x) = exp [0.4(x — 0.4)? — 0.082] , £z E€ (—00, 00). (29.10) 


We can plot this function (figure 29.1a). But that does not mean we can draw 
samples from it. To start with, we don’t know the normalizing constant Z. 
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To give ourselves a simpler problem, we could discretize the variable x and 
ask for samples from the discrete probability distribution over a finite set of 
uniformly spaced points {x;} (figure 29.1b). How could we solve this problem? 
If we evaluate p = P*(x;) at each point x;, we can compute 


Z= > pr (29.11) 


and 
Pi = 0 /Z (29.12) 


and we can then sample from the probability distribution {p;} using various 
methods based on a source of random bits (see section 6.3). But what is the 
cost of this procedure, and how does it scale with the dimensionality of the 
space, N? Let us concentrate on the initial cost of evaluating Z (29.11). To 
compute Z we have to visit every point in the space. In figure 29.1b there are 
50 uniformly spaced points in one dimension. If our system had N dimensions, 
N = 1000 say, then the corresponding number of points would be 501°, an 
unimaginable number of evaluations of P*. Even if each component £p took 
only two discrete values, the number of evaluations of P* would be 21000, 
number that is still horribly huge. If every electron in the universe (there are 

about 27°6 of them) were a 1000 gigahertz computer that could evaluate P* 

for a trillion (240) states every second, and if we ran those 276° computers for 

a time equal to the age of the universe (258 seconds), they would still only 

visit 2364 states. We’d have to wait for more than 2636 ~ 10! universe ages 

to elapse before all 21000 states had been visited. 

Systems with 21000 states are two a penny.* One example is a collection * Translation for American 

of 1000 spins such as a 30 x 30 fragment of an Ising model whose probability readers: ‘such systems are a dime 


distribution is proportional to a dozen’; incidentally, this 
equivalence (10c = 6p) shows that 


He) = the correct exchange rate between 
P'x) = exp[-8E(x)] (2212) our currencies is £1.00 = $1.67. 


a 





where £p € {+1} and 
1 


The energy function E(x) is readily evaluated for any x. But if we wish to 
evaluate this function at all states x, the computer time required would be 
21000 function evaluations. 

The Ising model is a simple model which has been around for a long time, 
but the task of generating samples from the distribution P(x) = P*(x)/Z is 
still an active research area; the first ‘exact’ samples from this distribution 
were created in the pioneering work of Propp and Wilson (1996), as we’ll 
describe in Chapter 32. 





Figure 29.2. A lake whose depth 
A useful analogy P XPE p 
Imagine the tasks of drawing random water samples from a lake and finding 
the average plankton concentration (figure 29.2). The depth of the lake at 
x = (x,y) is P*(x), and we assert (in order to make the analogy work) that 
the plankton concentration is a function of x, ¢(x). The required average 


concentration is an integral like (29.3), namely 


d = (¢(x)) = > / dx P*(x)¢(x), (29.15) 
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where Z = [dxdy P*(x) is the volume of the lake. You are provided with a 
boat, a satellite navigation system, and a plumbline. Using the navigator, you 
can take your boat to any desired location x on the map; using the plumbline 
you can measure P*(x) at that point. You can also measure the plankton 
concentration there. 

Problem 1 is to draw 1cm? water samples at random from the lake, in 
such a way that each sample is equally likely to come from any point within 
the lake. Problem 2 is to find the average plankton concentration. 

These are difficult problems to solve because at the outset we know nothing Figure 29.3. A slice through a lake 
about the depth P*(x). Perhaps much of the volume of the lake is contained that includes some canyons. 
in narrow, deep underwater canyons (figure 29.3), in which case, to correctly 
sample from the lake and correctly estimate ® our method must implicitly 
discover the canyons and find their volume relative to the rest of the lake. 

Difficult problems, yes; nevertheless, we’ll see that clever Monte Carlo methods 
can solve them. 


Uniform sampling 


Having accepted that we cannot exhaustively visit every location x in the 
state space, we might consider trying to solve the second problem (estimating 
the expectation of a function ¢(x)) by drawing random samples {x}, 
uniformly from the state space and evaluating P*(x) at those points. Then 
we could introduce a normalizing constant Zp, defined by 


R 
a= P0), (29.16) 
r=1 


and estimate @ = fd\x ¢(x)P(x) by 


Px") 


R 
b= hE x 
ô ae ) Zr (29.17) 


Is anything wrong with this strategy? Well, it depends on the functions ¢(x) 
and P*(x). Let us assume that ¢(x) is a benign, smoothly varying function 
and concentrate on the nature of P*(x). As we learnt in Chapter 4, a high- 
dimensional distribution is often concentrated in a small region of the state 
space known as its typical set T, whose volume is given by |T| ~ 2%), where 
H(X) is the entropy of the probability distribution P(x). If almost all the 
probability mass is located in the typical set and (x) is a benign function, 
the value of # = fd™x ¢(x)P(x) will be principally determined by the values 
that ¢(x) takes on in the typical set. So uniform sampling will only stand 
a chance of giving a good estimate of ® if we make the number of samples 
R sufficiently large that we are likely to hit the typical set at least once or 
twice. So, how many samples are required? Let us take the case of the Ising 
model again. (Strictly, the Ising model may not be a good example, since it 
doesn’t necessarily have a typical set, as defined in Chapter 4; the definition 
of a typical set was that all states had log probability close to the entropy, 
which for an Ising model would mean that the energy is very close to the 
mean energy; but in the vicinity of phase transitions, the variance of energy, 
also known as the heat capacity, may diverge, which means that the energy 
of a random state is not necessarily expected to be very close to the mean 
energy.) The total size of the state space is 2% states, and the typical set has 
size 27. So each sample has a chance of 27 /2% of falling in the typical set. 
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Figure 29.4. (a) Entropy of a 
64-spin Ising model as a function 
of temperature. (b) One state of a 
1024-spin Ising model. 
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(a) 5 Temperate ‘ 


The number of samples required to hit the typical set once is thus of order 
Rmin ens, (29.18) 


So, what is H? At high temperatures, the probability distribution of an Ising 
model tends to a uniform distribution and the entropy tends to Hmax = N 
bits, which means Rmin is of order 1. Under these conditions, uniform sampling 
may well be a satisfactory technique for estimating ®. But high temperatures 
are not of great interest. Considerably more interesting are intermediate tem- 
peratures such as the critical temperature at which the Ising model melts from 
an ordered phase to a disordered phase. The critical temperature of an infinite 
Ising model, at which it melts, is 8e = 2.27. At this temperature the entropy 
of an Ising model is roughly N/2 bits (figure 29.4). For this probability dis- 
tribution the number of samples required simply to hit the typical set once is 
of order 

Rig S Oe ON (29.19) 


which for N = 1000 is about 10'°. This is roughly the square of the number 
of particles in the universe. Thus uniform sampling is utterly useless for the 
study of Ising models of modest size. And in most high-dimensional problems, 
if the distribution P(x) is not actually uniform, uniform sampling is unlikely 
to be useful. 


Overview 


Having established that drawing samples from a high-dimensional distribution 
P(x) = P*(x)/Z is difficult even if P*(x) is easy to evaluate, we will now 
study a sequence of more sophisticated Monte Carlo methods: importance 
sampling, rejection sampling, the Metropolis method, Gibbs sampling, and 
slice sampling. 


»> 29.2 Importance sampling 


Importance sampling is not a method for generating samples from P(x) (prob- 
lem 1); it is just a method for estimating the expectation of a function ¢(x) 
(problem 2). It can be viewed as a generalization of the uniform sampling 
method. 

For illustrative purposes, let us imagine that the target distribution is a 
one-dimensional density P(x). Let us assume that we are able to evaluate this 
density at any chosen point x, at least to within a multiplicative constant; 
thus we can evaluate a function P*(x) such that 


P(x) = P*(x)/Z. (29.20) 
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But P(x) is too complicated a function for us to be able to sample from it 
directly. We now assume that we have a simpler density Q(x) from which we 
can generate samples and which we can evaluate to within a multiplicative 
constant (that is, we can evaluate Q*(x), where Q(x) = Q*(a)/Zg). An 
example of the functions P*, Q* and ¢ is shown in figure 29.5. We call Q the 
sampler density. 

In importance sampling, we generate R samples {cO from Q(x). If 
these points were samples from P(x) then we could estimate ® by equa- 
tion (29.6). But when we generate samples from Q, values of x where Q(z) is 
greater than P(x) will be over-represented in this estimator, and points where 
Q(x) is less than P(x) will be under-represented. To take into account the 
fact that we have sampled from the wrong distribution, we introduce weights 


P(e) 
Q(x) 


which we use to adjust the ‘importance’ of each point in our estimator thus: 


-E wrol) 
3 E pDA Wr l 


Exercise 29.1.1 P-384] Prove that, if Q(x) is non-zero for all x where P(x) is 
non-zero, the estimator ô converges to ®, the mean value of ¢(x), as R 
increases. What is the variance of this estimator, asymptotically? Hint: 
consider the statistics of the numerator and the denominator separately. 
Is the estimator Ê an unbiased estimator for small R? 


w, = (29.21) 


(29.22) 


A practical difficulty with importance sampling is that it is hard to estimate 
how reliable the estimator Ê is. The variance of the estimator is unknown 
beforehand, because it depends on an integral over x of a function involving 
P*(x). And the variance of Ê is hard to estimate, because the empirical 
variances of the quantities w, and wols) are not necessarily a good guide 
to the true variances of the numerator and denominator in equation (29.22). 
If the proposal density Q(x) is small in a region where |ọ(x)P*(x)| is large 
then it is quite possible, even after many points x") have been generated, that 
none of them will have fallen in that region. In this case the estimate of ® 
would be drastically wrong, and there would be no indication in the empirical 
variance that the true variance of the estimator Ê is large. 
































-6.2 : Tt 7 T T -6.2 T T T T 
l% 
-6.4 L | % 4 -6.4 L 4 
y 
-6.6 F | 4 
| RN pie 
-6.8 H | PJ a 
| 
are ‘ 4 7h 4 
(a) (b) 
TRIR a i i i i i Beas i i i i i 
10 100 1000 10000 100000 1000000 10 100 1000 10000 100000 1000000 


Cautionary illustration of importance sampling 


In a toy problem related to the modelling of amino acid probability distribu- 
tions with a one-dimensional variable x, I evaluated a quantity of interest us- 
ing importance sampling. The results using a Gaussian sampler and a Cauchy 
sampler are shown in figure 29.6. The horizontal axis shows the number of 
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Figure 29.5. Functions involved in 
importance sampling. We wish to 
estimate the expectation of ¢(x) 
under P(x) x P*(x). We can 
generate samples from the simpler 
distribution Q(x) x Q*(x). We 
can evaluate Q* and P* at any 
point. 


Figure 29.6. Importance sampling 
in action: (a) using a Gaussian 
sampler density; (b) using a 
Cauchy sampler density. Vertical 
axis shows the estimate ®. The 
horizontal line indicates the true 
value of ®. Horizontal axis shows 
number of samples on a log scale. 
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samples on a log scale. In the case of the Gaussian sampler, after about 500 
samples had been evaluated one might be tempted to call a halt; but evidently 
there are infrequent samples that make a huge contribution to ®, and the value 





of the estimate at 500 samples is wrong. Even after a million samples have À Po) — 
been taken, the estimate has still not settled down close to the true value. In ji phi 


contrast, the Cauchy sampler does not suffer from glitches; it converges (on a ached 
the scale shown here) after about 5000 samples. ses 











This example illustrates the fact that an importance sampler should have a f ; 
heavy tails. /\ | 
Exercise 29.2.7 P-385] Consider the situation where P*(x) is multimodal, con- 5 ; l 7 a = 
oe sisting of several widely-separated peaks. (Probability distributions like 


this arise frequently in statistical data modelling.) Discuss whether it is Figure 29.7. A multimodal 

a wise strategy to do importance sampling using a sampler Q(x) that distribution P*(x) and a unimodal 
is a unimodal distribution fitted to one of these peaks. Assume that sampler Q(x). 

the function ¢(x) whose mean ® is to be estimated is a smoothly vary- 

ing function of x such as mg + c. Describe the typical evolution of the 

estimator Ê as a function of the number of samples R. 


Importance sampling in many dimensions 


We have already observed that care is needed in one-dimensional importance 
sampling problems. Is importance sampling a useful technique in spaces of 
higher dimensionality, say N = 1000? 

Consider a simple case-study where the target density P(x) is a uniform 
distribution inside a sphere, 


P*(x) =4 ; a cad (29.23) 


where p(x) = ($; z2)", and the proposal density is a Gaussian centred on 
the origin, 
Q(x) = | [ Normal(ai; 0,07). (29.24) 


An importance-sampling method will be in trouble if the estimator Ê is dom- 
inated by a few large weights w,. What will be the typical range of values of 
the weights w,? We know from our discussions of typical sequences in Part I — 
see exercise 6.14 (p.124), for example — that if p is the distance from the origin 
of a sample from Q, the quantity p? has a roughly Gaussian distribution with 
mean and standard deviation: 


P ~ No? + V2No?. (29.25) 





Thus almost all samples from Q lie in a typical set with distance from the origin 
very close to VNo. Let us assume that ø is chosen such that the typical set 
of Q lies inside the sphere of radius Rp. [If it does not, then the law of large 
numbers implies that almost all the samples generated from Q will fall outside 
Rp and will have weight zero.] Then we know that most samples from Q will 
have a value of Q that lies in the range 





= =| : (29.26) 


1 
(2na?)N/2 Bee (-4 m 29 


Thus the weights w, = P*/Q will typically have values in the range 


(Qra?)N/? exp È È 4 } (29.27) 





2 2 
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Figure 29.8. Rejection sampling. 
(a) The functions involved in 
rejection sampling. We desire 
samples from P(x) x P*(x). We 
are able to draw samples from 
Q(x) x Q*(x), and we know a 
value c such that cQ*(x) > P*(«) 
for all x. (b) A point (a, u) is 
generated at random in the lightly 
shaded area under the curve 

So if we draw a hundred samples, what will the typical range of weights be? ¢Q*(a). If this point also lies 
We can roughly estimate the ratio of the largest weight to the median weight — below P *(x) then it is accepted. 
by doubling the standard deviation in equation (29.27). The largest weight 

and the median weight will typically be in the ratio: 


= exp (v2N) (29.28) 











max 
W 


wmed 





In N = 1000 dimensions therefore, the largest weight after one hundred sam- 
ples is likely to be roughly 101° times greater than the median weight. Thus an 
importance sampling estimate for a high-dimensional problem will very likely 
be utterly dominated by a few samples with huge weights. 

In conclusion, importance sampling in high dimensions often suffers from 
two difficulties. First, we need to obtain samples that lie in the typical set of P, 
and this may take a long time unless Q is a good approximation to P. Second, 
even if we obtain samples in the typical set, the weights associated with those 
samples are likely to vary by large factors, because the probabilities of points 
in a typical set, although similar to each other, still differ by factors of order 
exp(VN ), so the weights will too, unless Q is a near-perfect approximation to 
P: 


»> 29.3 Rejection sampling 


We assume again a one-dimensional density P(x) = P*(x)/Z that is too com- 
plicated a function for us to be able to sample from it directly. We assume 
that we have a simpler proposal density Q(x) which we can evaluate (within a 
multiplicative factor Zg, as before), and from which we can generate samples. 
We further assume that we know the value of a constant c such that 


cQ* (ax) > P*(x), for all z. (29.29) 


A schematic picture of the two functions is shown in figure 29.8a. 

We generate two random numbers. The first, x, is generated from the 
proposal density Q(x). We then evaluate cQ*(x) and generate a uniformly 
distributed random variable u from the interval [0,c Q*(«x)]. These two random 
numbers can be viewed as selecting a point in the two-dimensional plane as 
shown in figure 29.8b. 

We now evaluate P*(x) and accept or reject the sample x by comparing the 
value of u with the value of P(x). If u > P*(x) then z is rejected; otherwise 
it is accepted, which means that we add x to our set of samples {2}. The 
value of u is discarded. 

Why does this procedure generate samples from P(x)? The proposed point 
(x,u) comes with uniform probability from the lightly shaded area underneath 
the curve cQ*(x) as shown in figure 29.8b. The rejection rule rejects all the 
points that lie above the curve P*(x). So the points (x,u) that are accepted 
are uniformly distributed in the heavily shaded area under P*(x). This implies 
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that the probability density of the x-coordinates of the accepted points must 
be proportional to P*(x), so the samples must be independent samples from 
P(x). 

Rejection sampling will work best if Q is a good approximation to P. If Q 
is very different from P then, for cQ to exceed P everywhere, c will necessarily 
have to be large and the frequency of rejection will be large. 











Rejection sampling in many dimensions 








In a high-dimensional problem it is very likely that the requirement that cQ* 
be an upper bound for P* will force c to be so huge that acceptances will be 
very rare indeed. Finding such a value of c may be difficult too, since in many 
problems we know neither where the modes of P* are located nor how high 


Figure 29.9. A Gaussian P(x) and 
a slightly broader Gaussian Q(x) 
scaled up by a factor c such that 
they are. cQ(x) > P(x). 
As a case study, consider a pair of N-dimensional Gaussian distributions 
with mean zero (figure 29.9). Imagine generating samples from one with stan- 
dard deviation og and using rejection sampling to obtain samples from the 
other whose standard deviation is op. Let us assume that these two standard 
deviations are close in value — say, og is 1% larger than op. [7g must be larger 
than op because if this is not the case, there is no c such that cQ exceeds P 
for all x.] So, what value of c is required if the dimensionality is N = 1000? 
The density of Q(x) at the origin is 1/(2n03)*”, so for cQ to exceed P we 


need to set 
(2n03)N/? TQ 
= — = Nin—)}. 29.30 
c E exp ( n <2) ( ) 


With N = 1000 and ze = 1.01, we find c = exp(10) ~ 20,000. What will the 
acceptance rate be for this value of c? The answer is immediate: since the 
acceptance rate is the ratio of the volume under the curve P(x) to the volume 
under cQ(x), the fact that P and Q are both normalized here implies that 
the acceptance rate will be 1/c, for example, 1/20,000. In general, c grows 
exponentially with the dimensionality N, so the acceptance rate is expected 
to be exponentially small in N. PQ; 2) 

Rejection sampling, therefore, whilst a useful method for one-dimensional i 
problems, is not expected to be a practical technique for generating samples 
from high-dimensional distributions P(x). 





»> 29.4 The Metropolis—Hastings method 


Importance sampling and rejection sampling work well only if the proposal 
density Q(z) is similar to P(x). In large and complex problems it is difficult 
to create a single density Q(x) that has this property. 

The Metropolis—Hastings algorithm instead makes use of a proposal den- 
sity Q which depends on the current state x“). The density Q(a"; a) might 
be a simple distribution such as a Gaussian centred on the current +. The 
proposal density Q(x’; x) can be any fixed density from which we can draw 
samples. In contrast to importance sampling and rejection sampling, it is not 








Cen: 
z£ 

Figure 29.10. Metropolis-Hastings 
method in one dimension. The 
necessary that Q(x’; r®) look at all similar to P(x) in order for the algorithm proposal distribution Q(z’; £) is 


to be practically useful. An example of a proposal density is shown in fig- here shown as having a shape that 
ure 29.10; this figure shows the density Q(2’;2) for two different states 2) changes as x changes, though this 
and z2. is not typical of the proposal 


; densities used i tice. 
As before, we assume that we can evaluate P*(x) for any x. A tentative Seer ee Aci vag es 


new state x’ is generated from the proposal density Q(z’ sal), To decide 
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whether to accept the new state, we compute the quantity 


Pz!) Q@;2') 
= —— C. 29.31 
= Pe) Qa) oe 
If a > 1 then the new state is accepted. 
Otherwise, the new state is accepted with probability a. 


If the step is accepted, we set x+) 


t 
=g: 
If the step is rejected, then we set 2+) = 7, 


Note the difference from rejection sampling: in rejection sampling, rejected 
points are discarded and have no influence on the list of samples {2 ")} that 
we collected. Here, a rejection causes the current state to be written again 
onto the list. 

Notation. I have used the superscript r = 1,...,R to label points that 
are independent samples from a distribution, and the superscript t= 1,...,T 
to label the sequence of states in a Markov chain. It is important to note that 
a Metropolis—Hastings simulation of T iterations does not produce T indepen- 
dent samples from the target distribution P. The samples are dependent. 

To compute the acceptance probability (29.31) we need to be able to com- 
pute the probability ratios P(2’)/P(a) and Q(a;2')/Q(a'; 2). If the 
proposal density is a simple symmetrical density such as a Gaussian centred on 
the current point, then the latter factor is unity, and the Metropolis—Hastings 
method simply involves comparing the value of the target density at the two 
points. This special case is sometimes called the Metropolis method. How- 
ever, with apologies to Hastings, I will call the general Metropolis—Hastings 
algorithm for asymmetric Q ‘the Metropolis method’ since I believe important 
ideas deserve short names. 


Convergence of the Metropolis method to the target density 


It can be shown that for any positive Q (that is, any Q such that Q(z’; £) > 0 
for all x, 2’), as t — oo, the probability distribution of az) tends to P(x) = 
P*(x)/Z. [This statement should not be seen as implying that Q has to assign 
positive probability to every point x’ — we will discuss examples later where 
Q(x'; x) = 0 for some g, x’; notice also that we have said nothing about how 
rapidly the convergence to P(x) takes place.] 

The Metropolis method is an example of a Markov chain Monte Carlo 
method (abbreviated MCMC). In contrast to rejection sampling, where the 
accepted points {a} are independent samples from the desired distribution, 
Markov chain Monte Carlo methods involve a Markov process in which a se- 
quence of states {7} is generated, each sample x having a probability 
distribution that depends on the previous value, «—)), Since successive sam- 
ples are dependent, the Markov chain may have to be run for a considerable 
time in order to generate samples that are effectively independent samples 
from P. 

Just as it was difficult to estimate the variance of an importance sampling 
estimator, so it is difficult to assess whether a Markov chain Monte Carlo 
method has ‘converged’, and to quantify how long one has to wait to obtain 
samples that are effectively independent samples from P. 


Demonstration of the Metropolis method 


The Metropolis method is widely used for high-dimensional problems. Many 
implementations of the Metropolis method employ a proposal distribution 
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Figure 29.11. Metropolis method 
in two dimensions, showing a 
traditional proposal density that 
has a sufficiently small step size € 
that the acceptance frequency will 
be about 0.5. 





with a length scale e that is short relative to the longest length scale L of the 
probable region (figure 29.11). A reason for choosing a small length scale is 
that for most high-dimensional problems, a large random step from a typical 
point (that is, a sample from P(x)) is very likely to end in a state that has 
very low probability; such steps are unlikely to be accepted. If e is large, 
movement around the state space will only occur when such a transition to a 
low-probability state is actually accepted, or when a large random step chances 
to land in another probable state. So the rate of progress will be slow if large 
steps are used. 

The disadvantage of small steps, on the other hand, is that the Metropolis 
method will explore the probability distribution by a random walk, and a 
random walk takes a long time to get anywhere, especially if the walk is made 
of small steps. 


Exercise 29.3.4] Consider a one-dimensional random walk, on each step of 
al which the state moves randomly to the left or to the right with equal 
probability. Show that after T steps of size €, the state is likely to have 
moved only a distance about VTe. (Compute the root mean square 

distance travelled.) 


Recall that the first aim of Monte Carlo sampling is to generate a number of 
independent samples from the given distribution (a dozen, say). If the largest 
length scale of the state space is L, then we have to simulate a random-walk 
Metropolis method for a time T ~ (L/ €)? before we can expect to get a sample 
that is roughly independent of the initial condition — and that’s assuming that 
every step is accepted: if only a fraction f of the steps are accepted on average, 
then this time is increased by a factor 1/f. 


Rule of thumb: lower bound on number of iterations of a 
Metropolis method. If the largest length scale of the space of 
probable states is L, a Metropolis method whose proposal distribu- 
tion generates a random walk with step size « must be run for at 
least 


Ts (Eley (29.32) 


iterations to obtain an independent sample. 





This rule of thumb gives only a lower bound; the situation may be much 
worse, if, for example, the probability distribution consists of several islands 
of high probability separated by regions of low probability. 
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Figure 29.12. Metropolis method 
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To illustrate how slowly a random walk explores a state space, figure 29.12 
shows a simulation of a Metropolis algorithm for generating samples from the 





distribution: 1 
O 21 x € {0,1,2,...,20} 
Ee) = { 0 otherwise. (29.33) 
The proposal distribution is 
I2 gi =rHt1 
l. se 
Qe iz) = { 0 otherwise. (29:34) 


Because the target distribution P(x) is uniform, rejections occur only when 
the proposal takes the state to z’ = —1 or a2’ = 21. 

The simulation was started in the state x9 = 10 and its evolution is shown 
in figure 29.12a. How long does it take to reach one of the end states x = 0 
and x = 20? Since the distance is 10 steps, the rule of thumb (29.32) predicts 
that it will typically take a time T ~ 100 iterations to reach an end state. This 
is confirmed in the present example: the first step into an end state occurs on 
the 178th iteration. How long does it take to visit both end states? The rule 
of thumb predicts about 400 iterations are required to traverse the whole state 
space; and indeed the first encounter with the other end state takes place on 
the 540th iteration. Thus effectively-independent samples are generated only 
by simulating for about four hundred iterations per independent sample. 

This simple example shows that it is important to try to abolish random 
walk behaviour in Monte Carlo methods. A systematic exploration of the toy 
state space {0,1,2,...,20} could get around it, using the same step sizes, in 
about twenty steps instead of four hundred. Methods for reducing random 
walk behaviour are discussed in the next chapter. 


Metropolis method in high dimensions 


The rule of thumb (29.32), which gives a lower bound on the number of itera- 
tions of a random walk Metropolis method, also applies to higher-dimensional 
problems. Consider the simple case of a target distribution that is an N- 
dimensional Gaussian, and a proposal distribution that is a spherical Gaussian 
of standard deviation € in each direction. Without loss of generality, we can 
assume that the target distribution is a separable distribution aligned with the 
axes {£n}, and that it has standard deviation o,, in direction n. Let o™* and 
o™ be the largest and smallest of these standard deviations. Let us assume 
that € is adjusted such that the acceptance frequency is close to 1. Under this 
assumption, each variable x, evolves independently of all the others, executing 
a random walk with step size about e. The time taken to generate effectively 
independent samples from the target distribution will be controlled by the 
largest lengthscale o™**. Just as in the previous section, where we needed at 
least T ~ (L/c)? iterations to obtain an independent sample, here we need 
T ~ (o™* je). 

Now, how big can e be? The bigger it is, the smaller this number T be- 
comes, but if € is too big — bigger than o™™” — then the acceptance rate will 
fall sharply. It seems plausible that the optimal € must be similar to o™™. 
Strictly, this may not be true; in special cases where the second smallest on 
is significantly greater than o™™”, the optimal € may be closer to that second 
smallest o,. But our rough conclusion is this: where simple spherical pro- 
posal distributions are used, we will need at least T ~ (¢™°*/o™™)? iterations 
to obtain an independent sample, where o™* and o™™ are the longest and 
shortest lengthscales of the target distribution. 
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Figure 29.13. Gibbs sampling. 

(a) The joint density P(x) from 
which samples are required. (b) 
Starting from a state x, xı is 
sampled from the conditional 
density P(x, | al), (c) A sample 
is then made from the conditional 
density P(xə| x1). (d) A couple of 
iterations of Gibbs sampling. 


T2 














T2 








(c) 











This is good news and bad news. It is good news because, unlike the 
cases of rejection sampling and importance sampling, there is no catastrophic 
dependence on the dimensionality N. Our computer will give useful answers 
in a time shorter than the age of the universe. But it is bad news all the same, 
because this quadratic dependence on the lengthscale-ratio may still force us 
to make very lengthy simulations. 

Fortunately, there are methods for suppressing random walks in Monte 
Carlo simulations, which we will discuss in the next chapter. 


»> 29.5 Gibbs sampling 


We introduced importance sampling, rejection sampling and the Metropolis 
method using one-dimensional examples. Gibbs sampling, also known as the 
heat bath method or ‘Glauber dynamics’, is a method for sampling from dis- 
tributions over at least two dimensions. Gibbs sampling can be viewed as a 
Metropolis method in which a sequence of proposal distributions Q are defined 
in terms of the conditional distributions of the joint distribution P(x). It is 
assumed that, whilst P(x) is too complex to draw samples from directly, its 
conditional distributions P(x; |{æ;}jżi) are tractable to work with. For many 
graphical models (but not all) these one-dimensional conditional distributions 
are straightforward to sample from. For example, if a Gaussian distribution 
for some variables d has an unknown mean m, and the prior distribution of m 
is Gaussian, then the conditional distribution of m given d is also Gaussian. 
Conditional distributions that are not of standard form may still be sampled 
from by adaptive rejection sampling if the conditional distribution satisfies 
certain convexity properties (Gilks and Wild, 1992). 

Gibbs sampling is illustrated for a case with two variables (£1, £2) = x 
in figure 29.13. On each iteration, we start from the current state x, and 
x, is sampled from the conditional density P(x,|x2), with x2 fixed to a), 
A sample xə is then made from the conditional density P(x | 21), using the 
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new value of xı. This brings us to the new state x+"), and completes the 
iteration. 

In the general case of a system with K variables, a single iteration involves 
sampling one parameter at a time: 








oft) xN P(x | 0), rP, see jo) (29.35) 
0m Plaga 20...) (29.36 
aD o Plaglatt? a9, a), ete (2937) 


Convergence of Gibbs sampling to the target density 


> Exercise 29.4.12] Show that a single variable-update of Gibbs sampling can 
be viewed as a Metropolis method with target density P(x), and that 
this Metropolis method has the property that every proposal is always 
accepted. 


Because Gibbs sampling is a Metropolis method, the probability distribution 
of x tends to P(x) as t > oo, as long as P(x) does not have pathological 
properties. 


> Exercise 29.5.1% P-385] Discuss whether the syndrome decoding problem for a 
(7,4) Hamming code can be solved using Gibbs sampling. The syndrome 
decoding problem, if we are to solve it with a Monte Carlo approach, 
is to draw samples from the posterior distribution of the noise vector 


n= (n1,...,Mn,---; MN); 
1 N 
pa Nn = (1-—nn) = 
P(n|f,z) = z lI fred — fn) 1[Hn =z], (29.38) 


where fn is the normalized likelihood for the nth transmitted bit and z 
is the observed syndrome. The factor 1[Hn =z] is 1 if n has the correct 
syndrome z and 0 otherwise. 


What about the syndrome decoding problem for any linear error-correcting 
code? 


Gibbs sampling in high dimensions 


Gibbs sampling suffers from the same defect as simple Metropolis algorithms 
— the state space is explored by a slow random walk, unless a fortuitous pa- 
rameterization has been chosen that makes the probability distribution P(x) 
separable. If, say, two variables x, and x are strongly correlated, having 
marginal densities of width L and conditional densities of width €, then it will 
take at least about (L/e)? iterations to generate an independent sample from 
the target density. Figure 30.3, p.390, illustrates the slow progress made by 
Gibbs sampling when L > e. 

However Gibbs sampling involves no adjustable parameters, so it is an at- 
tractive strategy when one wants to get a model running quickly. An excellent 
software package, BUGS, makes it easy to set up almost arbitrary probabilistic 
models and simulate them by Gibbs sampling (Thomas et al., 1992).! 


‘http: //www.mrc-bsu.cam.ac.uk/bugs/ 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 


You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


372 


»> 29.6 Terminology for Markov chain Monte Carlo methods 


We now spend a few moments sketching the theory on which the Metropolis 
method and Gibbs sampling are based. We denote by p® (x) the probabil- 
ity distribution of the state of a Markov chain simulator. (To visualize this 
distribution, imagine running an infinite collection of identical simulators in 
parallel.) Our aim is to find a Markov chain such that as t > 00, p(x) tends 
to the desired distribution P(x). 

A Markov chain can be specified by an initial probability distribution 
p® (x) and a transition probability T(x’; x). 

The probability distribution of the state at the (t+1)th iteration of the 
Markov chain, p+ (x), is given by 

pit) (x!) = [aks T(x’; x)p® (x). (29.39) 

Example 29.6. An example of a Markov chain is given by the Metropolis 
demonstration of section 29.4 (figure 29.12), for which the transition proba- 


bility is 
Vole. . 
1h e LA Sioa 
- 2. 1/2. 


Ja.. 
Sa 
; pik 


2. Y 
- 1⁄2 
and the initial distribution was 


p(z) =| 


The probability distribution p® (x) of the state at the tth iteration is shown 
for t = 0, 1, 2, 3, 5, 10, 100, 200, 400 in figure 29.14; an equivalent sequence of 
distributions is shown in figure 29.15 for the chain that begins in initial state 
zo = 17. Both chains converge to the target density, the uniform density, as 
t— œ. 


(29.40) 


Required properties 
When designing a Markov chain Monte Carlo method, we construct a chain 


with the following properties: 


1. The desired distribution P(x) is an invariant distribution of the chain. 
A distribution a(x) is an invariant distribution of the transition proba- 
bility T(x’; x) if 


q(x’) = [ars T(x’; x)m(x). (29.41) 


An invariant distribution is an eigenvector of the transition probability 
matrix that has eigenvalue 1. 


29 — Monte Carlo Methods 








Figure 29.14. The probability 
distribution of the state of the 
Markov chain of example 29.6. 
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2. The chain must also be ergodic, that is, | 
p® (x) > q(x) as t > œ, for any p(x). (29.42) 
A couple of reasons why a chain might not be ergodic are: 


(a) Its matrix might be reducible, which means that the state space 
contains two or more subsets of states that can never be reached 
from each other. Such a chain has many invariant distributions; 
which one p(x) would tend to as t — oo would depend on the 
initial condition p(x). 





The transition probability matrix of such a chain has more than 


one eigenvalue equal to 1. doj 








(b) The chain might have a periodic set, which means that, for some i 
initial conditions, p(x) doesn’t tend to an invariant distribution, 
but instead tends to a periodic limit-cycle. po 
A simple Markov chain with this property is the random walk on the 
N-dimensional hypercube. The chain T takes the state from one p(s) ae 
corner to a randomly chosen adjacent corner. The unique invariant oo s 1 s w 
distribution of this chain is the uniform distribution over all 2% poy ae 


states, but the chain is not ergodic; it is periodic with period two: 

if we divide the states into states with odd parity and states with Figure 29.15. The probability 
even parity, we notice that every odd state is surrounded by even distribution of the state of the 
states and vice versa. So if the initial condition at time t = O isa Markov chain for initial condition 
state with even parity, then at time t = 1 — and at all odd times zo = 17 (example 29.6 (p.372)). 
— the state must have odd parity, and at all even times, the state 

will be of even parity. 

The transition probability matrix of such a chain has more than 

one eigenvalue with magnitude equal to 1. The random walk on 

the hypercube, for example, has eigenvalues equal to +1 and —1. 


Methods of construction of Markov chains 


It is often convenient to construct T by mixing or concatenating simple base 
transitions B all of which satisfy 


P(x’) = J dx B(x';x)P(x), (29.43) 


for the desired density P(x), i.e., they all have the desired density as an 
invariant distribution. These base transitions need not individually be ergodic. 

T is a mixture of several base transitions B(x’, x) if we make the transition 
by picking one of the base transitions at random, and allowing it to determine 
the transition, i.e., 


T(x',x) = X p Ba(x', x), (29.44) 
b 


where {pp} is a probability distribution over the base transitions. 

T is a concatenation of two base transitions By(x’,x) and B(x’,x) if we 
first make a transition to an intermediate state x” using B4, and then make a 
transition from state x” to x’ using Bo. 


T(x',x) = if dx” Bo(x', x") Bi(x",x). (29.45) 
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Detailed balance 


Many useful transition probabilities satisfy the detailed balance property: 
T (Xa; Xb)P (xb) = T (Xb; Xa)P(Xa), for all x, and xa. (29.46) 


This equation says that if we pick (by magic) a state from the target density 
P and make a transition under T to another state, it is just as likely that we 
will pick x, and go from x, to Xa as it is that we will pick Xa and go from Xa 
to xy. Markov chains that satisfy detailed balance are also called reversible 
Markov chains. The reason why the detailed-balance property is of interest 
is that detailed balance implies invariance of the distribution P(x) under the 
Markov chain T, which is a necessary condition for the key property that we 
want from our MCMC simulation — that the probability distribution of the 
chain should converge to P(x). 


Exercise 29.7.1] Prove that detailed balance implies invariance of the distri- 
bution P(x) under the Markov chain T. 


Proving that detailed balance holds is often a key step when proving that a 
Markov chain Monte Carlo simulation will converge to the desired distribu- 
tion. The Metropolis method satisfies detailed balance, for example. Detailed 
balance is not an essential condition, however, and we will see later that ir- 
reversible Markov chains can be useful in practice, because they may have 
different random walk properties. 


Exercise 29.8, |?! Show that, if we concatenate two base transitions Bı and B2 
that satisfy detailed balance, it is not necessarily the case that the T 
thus defined (29.45) satisfies detailed balance. 


Exercise 29.9.!7] Does Gibbs sampling, with several variables all updated in a 
deterministic sequence, satisfy detailed balance? 


29.7 Slice sampling 


Slice sampling (Neal, 1997a; Neal, 2003) is a Markov chain Monte Carlo 
method that has similarities to rejection sampling, Gibbs sampling and the 
Metropolis method. It can be applied wherever the Metropolis method can 
be applied, that is, to any system for which the target density P*(x) can be 
evaluated at any point x; it has the advantage over simple Metropolis methods 
that it is more robust to the choice of parameters like step sizes. The sim- 
plest version of slice sampling is similar to Gibbs sampling in that it consists of 
one-dimensional transitions in the state space; however there is no requirement 
that the one-dimensional conditional distributions be easy to sample from, nor 
that they have any convexity properties such as are required for adaptive re- 
jection sampling. And slice sampling is similar to rejection sampling in that 
it is a method that asymptotically draws samples from the volume under the 
curve described by P*(x); but there is no requirement for an upper-bounding 
function. 

I will describe slice sampling by giving a sketch of a one-dimensional sam- 
pling algorithm, then giving a pictorial description that includes the details 
that make the method valid. 


29 — Monte Carlo Methods 
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The skeleton of slice sampling 


Let us assume that we want to draw samples from P(x) « P*(x) where x 
is a real number. A one-dimensional slice sampling algorithm is a method 
for making transitions from a two-dimensional point (x,u) lying under the 
curve P*(x) to another point (2’,u’) lying under the same curve, such that 
the probability distribution of (x,u) tends to a uniform distribution over the 
area under the curve P*(«x), whatever initial point we start from — like the 
uniform distribution under the curve P*(x) produced by rejection sampling 
(section 29.3). 

A single transition (x,u) — (2’,u’) of a one-dimensional slice sampling 
algorithm has the following steps, of which steps 3 and 8 will require further 
elaboration. 


: evaluate P*(zr) 
: draw a vertical coordinate u’ ~ Uniform(0, P*(x)) 
3: create a horizontal interval (£1, £p) enclosing x 


: loop { 


draw x! ~ Uniform(2, £r) 

evaluate P*(x’) 

if P*(z') > u’ break out of loop 4-9 
else modify the interval (7, £r) 








There are several methods for creating the interval (xı, £r) in step 3, and 
several methods for modifying it at step 8. The important point is that the 
overall method must satisfy detailed balance, so that the uniform distribution 
for (x,u) under the curve P*() is invariant. 


The ‘stepping out’ method for step 3 


In the ‘stepping out’ method for creating an interval (x1, £r) enclosing x, we 
step out in steps of length w until we find endpoints x; and x, at which P* is 
smaller than u. The algorithm is shown in figure 29.16. 


: draw r ~ Uniform(0, 1) 

1 ay =u rw 

: £r := 2+(1l—-r)w 

: while (P*(a;) > u’) { t:= r1- w } 
: while (P*(z,) > u’) { a := a, +w } 





The ‘shrinking’ method for step 8 


Whenever a point x’ is drawn such that (2’,u’) lies above the curve P*(z), 
we shrink the interval so that one of the end points is x’, and such that the 
original point x is still enclosed in the interval. 


8a: if (x >a) { x, := x} 


8b: else { x, := 2" } 








Properties of slice sampling 


Like a standard Metropolis method, slice sampling gets around by a random 
walk, but whereas in the Metropolis method, the choice of the step size is 
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Figure 29.16. Slice sampling. Each 
panel is labelled by the steps of 
the algorithm that are executed in 
it. At step 1, P*(a) is evaluated 
at the current point x. At step 2, 
a vertical coordinate is selected 
giving the point (x, u’) shown by 
the box; At steps 3a-c, an 
interval of size w containing 

(x, u’) is created at random. At 
step 3d, P* is evaluated at the left 
end of the interval and is found to 
be larger than u’, so a step to the 
left of size w is made. At step 3e, 
P* is evaluated at the right end of 
the interval and is found to be 
smaller than u’, so no stepping 
out to the right is needed. When 
step 3d is repeated, P* is found to 
be smaller than u’, so the 
stepping out halts. At step 5a 
point is drawn from the interval, 
shown by a o. Step 6 establishes 
that this point is above P* and 
step 8 shrinks the interval to the 
rejected point in such a way that 
the original point «x is still in the 
interval. When step 5 is repeated, 
the new coordinate x’ (which is to 
the right-hand side of the 
interval) gives a value of P* 
greater than u’, so this point x’ is 
the outcome at step 7. 

















3a,3b,3c 
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critical to the rate of progress, in slice sampling the step size is self-tuning. If 
the initial interval size w is too small by a factor f compared with the width of 
the probable region then the stepping-out procedure expands the interval size. 
The cost of this stepping-out is only linear in f, whereas in the Metropolis 
method the computer-time scales as the square of f if the step size is too 
small. 


10 

















If the chosen value of w is too large by a factor F then the algorithm 
spends a time proportional to the logarithm of F shrinking the interval down 
to the right size, since the interval typically shrinks by a factor in the ballpark 
of 0.6 each time a point is rejected. In contrast, the Metropolis algorithm 1 
























































responds to a too-large step size by rejecting almost all proposals, so the rate Dad So OIL 
of progress is exponentially bad in F. There are no rejections in slice sampling. 
The probability of staying in exactly the same place is very small. Figure 29.17. P*(x). 


> Exercise 29.10. |?! Investigate the properties of slice sampling applied to the 
density shown in figure 29.17. x is a real variable between 0.0 and 11.0. 
How long does it take typically for slice sampling to get from an g in 
the peak region x € (0,1) to an z in the tail region x € (1,11), and vice 
versa? Confirm that the probabilities of these transitions do yield an 
asymptotic probability density that is correct. 


How slice sampling is used in real problems 


An N-dimensional density P(x) « P*(x) may be sampled with the help of the 
one-dimensional slice sampling method presented above by picking a sequence 
of directions y®,y®,... and defining x = x + 2y. The function P*(2) 
above is replaced by P*(x) = P*(x + xy). The directions may be chosen 
in various ways; for example, as in Gibbs sampling, the directions could be the 
coordinate axes; alternatively, the directions y) may be selected at random 
in any manner such that the overall procedure satisfies detailed balance. 


Computer-friendly slice sampling 


The real variables of a probabilistic model will always be represented in a 
computer using a finite number of bits. In the following implementation of 
slice sampling due to Skilling, the stepping-out, randomization, and shrinking 
operations, described above in terms of floating-point operations, are replaced 
by binary and integer operations. 

We assume that the variable x that is being slice-sampled is represented by 
a b-bit integer X taking on one of B = 2? values, 0,1,2,...,B—1, many or all 
of which correspond to valid values of x. Using an integer grid eliminates any 
errors in detailed balance that might ensue from variable-precision rounding of 
floating-point numbers. The mapping from X to x need not be linear; if it is 
nonlinear, we assume that the function P*(x) is replaced by an appropriately 
transformed function — for example, P**(X) x P*(x)|da/dxX]. 

We assume the following operators on b-bit integers are available: 





X+WN arithmetic sum, modulo B, of X and N. 
X—WN difference, modulo B, of X and N. 
XEN bitwise exclusive-or of X and N. 

N :=randbits(l) sets N to a random l-bit integer. 


A slice-sampling procedure for integers is then as follows: 
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Given: a current point X and a height Y = P*(X) x Uniform(0,1) < P*(X) 


1: U:=randbits(b) Define a random translation U of the binary coor- 
dinate system. 

2: set! toavaluel <b Set initial /-bit sampling range. 

3: dof 

4: N :=randbits(l) Define a random move within the current interval of 
width 2. 

5: X':=((X —U)@N)+U Randomize the lowest | bits of X (in the translated 
coordinate system). 

6: i:=1-1 If X’ is not acceptable, decrease l and try again 

7: } until (X’ = X) or (P*(X’) > Y) with a smaller perturbation of X; termination at or 


before l = 0 is assured. 





The translation U is introduced to avoid permanent sharp edges, where 
for example the adjacent binary integers 0111111111 and 1000000000 would 
otherwise be permanently in different sectors, making it difficult for X to move 
from one to the other. 





The sequence of intervals from which the new candidate points are drawn ~ 
is illustrated in figure 29.18. First, a point is drawn from the entire interval, — 
KH 
shown by the top horizontal line. At each subsequent draw, the interval is on ene akanan, 
X B-1 


halved in such a way as to contain the previous point X. 
If preliminary stepping-out from the initial range is required, step 2 above Figure 29.18. The sequence of 

can be replaced by the following similar procedure: intervals from which the new 

candidate points are drawn. 

2a: set! toa value l< b l sets the initial width 

2b: do { 

2c: N:=randbits(l) 


2d: X':=((X-U)®N)+U 
2e: l:=l+1 
2f: } until (J = b) or (P*(X’) < Y) 








These shrinking and stepping out methods shrink and expand by a factor 
of two per evaluation. A variant is to shrink or expand by more than one bit 
each time, setting l:=l + Al with Al > 1. Taking Al at each step from any 
pre-assigned distribution (which may include Al = 0) allows extra flexibility. 





Exercise 29.11.!4] In the shrinking phase, after an unacceptable X’ has been 
produced, the choice of Al is allowed to depend on the difference between 
the slice’s height Y and the value of P*( X’), without spoiling the algo- 
rithm’s validity. (Prove this.) It might be a good idea to choose a larger 
value of Al when Y — P*(X’) is large. Investigate this idea theoretically 
or empirically. 


A feature of using the integer representation is that, with a suitably ex- 
tended number of bits, the single integer X can represent two or more real 
parameters — for example, by mapping X to (#1, £2, 73) through a space-filling 
curve such as a Peano curve. Thus multi-dimensional slice sampling can be 
performed using the same software as for one dimension. 
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> 29.8 Practicalities 


Can we predict how long a Markov chain Monte Carlo simulation 
will take to equilibrate? By considering the random walks involved in a 
Markov chain Monte Carlo simulation we can obtain simple lower bounds on 
the time required for convergence. But predicting this time more precisely is a 
difficult problem, and most of the theoretical results giving upper bounds on 
the convergence time are of little practical use. The exact sampling methods 
of Chapter 32 offer a solution to this problem for certain Markov chains. 


Can we diagnose or detect convergence in a running simulation? 
This is also a difficult problem. There are a few practical tools available, but 
none of them is perfect (Cowles and Carlin, 1996). 


Can we speed up the convergence time and time between indepen- 
dent samples of a Markov chain Monte Carlo method? Here, there is 
good news, as described in the next chapter, which describes the Hamiltonian 
Monte Carlo method, overrelaxation, and simulated annealing. 


> 29.9 Further practical issues 


Can the normalizing constant be evaluated? 


If the target density P(x) is given in the form of an unnormalized density 
P*(x) with P(x) = 7P*(x), the value of Z may well be of interest. Monte 
Carlo methods do not readily yield an estimate of this quantity, and it is an 
area of active research to find ways of evaluating it. Techniques for evaluating 
Z include: 


1. Importance sampling (reviewed by Neal (1993b)) and annealed impor- 
tance sampling (Neal, 1998). 


2. ‘Thermodynamic integration’ during simulated annealing, the ‘accep- 
tance ratio’ method, and ‘umbrella sampling’ (reviewed by Neal (1993b)). 


3. ‘Reversible jump Markov chain Monte Carlo’ (Green, 1995). 


One way of dealing with Z, however, may be to find a solution to one’s 
task that does not require that Z be evaluated. In Bayesian data modelling 
one might be able to avoid the need to evaluate Z — which would be important 
for model comparison — by not having more than one model. Instead of using 
several models (differing in complexity, for example) and evaluating their rel- 
ative posterior probabilities, one can make a single hierarchical model having, 
for example, various continuous hyperparameters which play a role similar to 
that played by the distinct models (Neal, 1996). In noting the possibility of 
not computing Z, I am not endorsing this approach. The normalizing constant 
Z is often the single most important number in the problem, and I think every 
effort should be devoted to calculating it. 


The Metropolis method for big models 


Our original description of the Metropolis method involved a joint updating 
of all the variables using a proposal density Q(x’;x). For big problems it 
may be more efficient to use several proposal distributions Q) (x’; x), each of 
which updates only some of the components of x. Each proposal is individually 
accepted or rejected, and the proposal distributions are repeatedly run through 
in sequence. 
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> Exercise 29.12, |? P-385] Explain why the rate of movement through the state 
space will be greater when B proposals Q@,...,Q) are considered 
individually in sequence, compared with the case of a single proposal 
Q* defined by the concatenation of Q™,...,Q). Assume that each 
proposal distribution Q\)(x’;x) has an acceptance rate f < 1/2. 


In the Metropolis method, the proposal density Q(x’;x) typically has a 
number of parameters that control, for example, its ‘width’. These parameters 
are usually set by trial and error with the rule of thumb being to aim for a 
rejection frequency of about 0.5. It is not valid to have the width parameters 
be dynamically updated during the simulation in a way that depends on the 
history of the simulation. Such a modification of the proposal density would 
violate the detailed-balance condition that guarantees that the Markov chain 
has the correct invariant distribution. 


Gibbs sampling in big models 


Our description of Gibbs sampling involved sampling one parameter at a time, 
as described in equations (29.35-29.37). For big problems it may be more 
efficient to sample groups of variables jointly, that is to use several proposal 





distributions: 
ott) . tD o P(r,...,2¢| a), eee jo) (29.47) 
git) waht at) ~ P(Gat1,---,2o | oth) td ith) | fs oe jo), etc. 


How many samples are needed? 


At the start of this chapter, we observed that the variance of an estimator ® 
depends only on the number of independent samples R and the value of 


o? = fex P(x)(¢(x) — ®)?. (29.48) 


We have now discussed a variety of methods for generating samples from P(x). 
How many independent samples R should we aim for? 

In many problems, we really only need about twelve independent samples 
from P(x). Imagine that x is an unknown vector such as the amount of 
corrosion present in each of 10000 underground pipelines around Cambridge, 
and ¢(x) is the total cost of repairing those pipelines. The distribution P(x) 
describes the probability of a state x given the tests that have been carried out 
on some pipelines and the assumptions about the physics of corrosion. The 
quantity ® is the expected cost of the repairs. The quantity g? is the variance 
of the cost — o measures by how much we should expect the actual cost to 
differ from the expectation ®. 

Now, how accurately would a manager like to know ®? I would suggest 
there is little point in knowing ® to a precision finer than about 0/3. After 
all, the true cost is likely to differ by to from ®. If we obtain R = 12 
independent samples from P(x), we can estimate ® to a precision of o/ V12 - 
which is smaller than o/3. So twelve samples suffice. 





Allocation of resources 


Assuming we have decided how many independent samples R are required, 
an important question is how one should make use of one’s limited computer 
resources to obtain these samples. 
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(1) =—————— 


Figure 29.19. Three possible 
Markov chain Monte Carlo 
strategies for obtaining twelve 
samples in a fixed amount of 
computer time. Time is 


| 
ii 


represented by horizontal lines; 
— samples by white circles. (1) A 
=— single run consisting of one long 
— ‘burn in’ period followed by a 
= sampling period. (2) Four 
— medium-length runs with different 


initial conditions and a 
medium-length burn in period. 


: ; i ; eat (3) Twelve short runs. 
A typical Markov chain Monte Carlo experiment involves an initial pe- 


riod in which control parameters of the simulation such as step sizes may be 
adjusted. This is followed by a ‘burn in’ period during which we hope the 
simulation ‘converges’ to the desired distribution. Finally, as the simulation 
continues, we record the state vector occasionally so as to create a list of states 
{xO that we hope are roughly independent samples from P(x). 

There are several possible strategies (figure 29.19): 


1. Make one long run, obtaining all R samples from it. 


2. Make a few medium-length runs with different initial conditions, obtain- 
ing some samples from each. 


3. Make R short runs, each starting from a different random initial condi- 
tion, with the only state that is recorded being the final state of each 
simulation. 


The first strategy has the best chance of attaining ‘convergence’. The last 
strategy may have the advantage that the correlations between the recorded 
samples are smaller. The middle path is popular with Markov chain Monte 
Carlo experts (Gilks et al., 1996) because it avoids the inefficiency of discarding 
burn-in iterations in many runs, while still allowing one to detect problems 
with lack of convergence that would not be apparent from a single run. 

Finally, I should emphasize that there is no need to make the points in 
the estimate nearly-independent. Averaging over dependent points is fine — it 
won't lead to any bias in the estimates. For example, when you use strategy 
1 or 2, you may, if you wish, include all the points between the first and last 
sample in each run. Of course, estimating the accuracy of the estimate is 
harder when the points are dependent. 


> 29.10 Summary 
e Monte Carlo methods are a powerful tool that allow one to sample from 
any probability distribution that can be expressed in the form P(x) = 
1 px 
= P*(x). 
Z 


e Monte Carlo methods can answer virtually any query related to P(x) by 
putting the query in the form 


/ (x) P(x) ~ => (x), (29.49) 
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e In high-dimensional problems the only satisfactory methods are those 


based on Markov chains, such as the Metropolis method, Gibbs sam- 
pling and slice sampling. Gibbs sampling is an attractive method be- 
cause it has no adjustable parameters but its use is restricted to cases 
where samples can be generated from the conditional distributions. Slice 
sampling is attractive because, whilst it has step-length parameters, its 
performance is not very sensitive to their values. 


Simple Metropolis algorithms and Gibbs sampling algorithms, although 
widely used, perform poorly because they explore the space by a slow 
random walk. The next chapter will discuss methods for speeding up 
Markov chain Monte Carlo simulations. 


Slice sampling does not avoid random walk behaviour, but it automat- 
ically chooses the largest appropriate step size, thus reducing the bad 
effects of the random walk compared with, say, a Metropolis method 
with a tiny step size. 


> 29.11 Exercises 


“> Exercise 29.13.12 P-386] a study of importance sampling. We already estab- 


lished in section 29.2 that importance sampling is likely to be useless in 
high-dimensional problems. This exercise explores a further cautionary 
tale, showing that importance sampling can fail even in one dimension, 
even with friendly Gaussian distributions. 


Imagine that we want to know the expectation of a function ¢(x) under 
a distribution P(x), 


b= | dx Pœ), (29.50) 


and that this expectation is estimated by importance sampling with 
a distribution Q(x). Alternatively, perhaps we wish to estimate the 
normalizing constant Z in P(x) = P*(x)/Z using 


z= furo- [eames =(F5)_ 9: (29.51) 








Now, let P(x) and Q(x) be Gaussian distributions with mean zero and 
standard deviations op and og. Each point x drawn from Q will have 
an associated weight P*(x)/Q(x). What is the variance of the weights? 
[Assume that P* = P, so P is actually normalized, and Z = 1, though 
we can pretend that we didn’t know that.] What happens to the variance 


of the weights as o? = o? /2? 


Check your theory by simulating this importance-sampling problem on 
a computer. 


Exercise 29.14.!7] Consider the Metropolis algorithm for the one-dimensional 


toy problem of section 29.4, sampling from {0,1,...,20}. Whenever 
the current state is one of the end states, the proposal density given in 
equation (29.34) will propose with probability 50% a state that will be 
rejected. 


To reduce this ‘waste’, Fred modifies the software responsible for gen- 
erating samples from Q so that when x = 0, the proposal density is 
100% on x’ = 1, and similarly when x = 20, x’ = 19 is always proposed. 


29 — Monte Carlo Methods 
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29.11: Exercises 


Fred sets the software that implements the acceptance rule so that the 
software accepts all proposed moves. What probability P’(a) will Fred’s 
modified software generate samples from? 


What is the correct acceptance rule for Fred’s proposal density, in order 
to obtain samples from P(x)? 


> Exercise 29.15.120] Implement Gibbs sampling for the inference of a single 
one-dimensional Gaussian, which we studied using maximum likelihood 
in section 22.1. Assign a broad Gaussian prior to u and a broad gamma 
prior (24.2) to the precision parameter 3 = 1/07. Each update of u will 
involve a sample from a Gaussian distribution, and each update of o 
requires a sample from a gamma distribution. 


Exercise 29.16.19] Gibbs sampling for clustering. Implement Gibbs sampling 
= for the inference of a mixture of K one-dimensional Gaussians, which we 
studied using maximum likelihood in section 22.2. Allow the clusters to 
have different standard deviations o,. Assign priors to the means and 
standard deviations in the same way as the previous exercise. Either fix 
the prior probabilities of the classes {7} to be equal or put a uniform 

prior over the parameters 7 and include them in the Gibbs sampling. 


Notice the similarity of Gibbs sampling to the soft K-means clustering 
algorithm (algorithm 22.2). We can alternately assign the class labels 
{kn} given the parameters {uk, ok}, then update the parameters given 
the class labels. The assignment step involves sampling from the proba- 
bility distributions defined by the responsibilities (22.22), and the update 
step updates the means and variances using probability distributions 
centred on the K-means algorithm’s values (22.23, 22.24). 


Do your experiments confirm that Monte Carlo methods bypass the over- 
fitting difficulties of maximum likelihood discussed in section 22.4? 


A solution to this exercise and the previous one, written in octave, is 
available.” 


> Exercise 29.17.120] Implement Gibbs sampling for the seven scientists inference 
problem, which we encountered in exercise 22.15 (p.309), and which you 
may have solved by exact marginalization (exercise 24.3 (p.323)) [it’s 
not essential to have done the latter]. 


> Exercise 29.18.17] A Metropolis method is used to explore a distribution P(x) 
that is actually a 1000-dimensional spherical Gaussian distribution of 
standard deviation 1 in all dimensions. The proposal density Q is a 
1000-dimensional spherical Gaussian distribution of standard deviation 
e. Roughly what is the step size € if the acceptance rate is 0.5? Assuming 
this value of e, 


(a) roughly how long would the method take to traverse the distribution 
and generate a sample independent of the initial condition? 


(b) By how much does In P(x) change in a typical step? By how much 
should In P(x) vary when x is drawn from P(x)? 


(c) What happens if, rather than using a Metropolis method that tries 
to change all components at once, one instead uses a concatenation 
of Metropolis updates changing one component at a time? 
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> Exercise 29.19.!?] When discussing the time taken by the Metropolis algo- 
rithm to generate independent samples we considered a distribution with 
longest spatial length scale L being explored using a proposal distribu- 
tion with step size «. Another dimension that a MCMC method must 
explore is the range of possible values of the log probability In P*(x). 
Assuming that the state x contains a number of independent random 
variables proportional to N, when samples are drawn from P(x), the 
‘asymptotic equipartition’ principle tell us that the value of — In P(x) is 
likely to be close to the entropy of x, varying either side with a standard 
deviation that scales as VN. Consider a Metropolis method with a sym- 
metrical proposal density, that is, one that satisfies Q(x; x’) = Q(x’; x). 
Assuming that accepted jumps either increase In P*(x) by some amount 
or decrease it by a small amount, e.g. Ine = 1 (is this a reasonable 
assumption?), discuss how long it must take to generate roughly inde- 
pendent samples from P(x). Discuss whether Gibbs sampling has similar 
properties. 


Exercise 29.20.19] Markov chain Monte Carlo methods do not compute parti- 
tion functions Z, yet they allow ratios of quantities like Z to be esti- 
mated. For example, consider a random-walk Metropolis algorithm in a 
state space where the energy is zero in a connected accessible region, and 
infinitely large everywhere else; and imagine that the accessible space can 
be chopped into two regions connected by one or more corridor states. 
The fraction of times spent in each region at equilibrium is proportional 
to the volume of the region. How does the Monte Carlo method manage 
to do this without measuring the volumes? 


Exercise 29.21.!9] Philosophy. 


One curious defect of these Monte Carlo methods — which are widely used 
by Bayesian statisticians — is that they are all non-Bayesian (O’Hagan, 
1987). They involve computer experiments from which estimators of 
quantities of interest are derived. These estimators depend on the pro- 
posal distributions that were used to generate the samples and on the 
random numbers that happened to come out of our random number 
generator. In contrast, an alternative Bayesian approach to the problem 
would use the results of our computer experiments to infer the proper- 
ties of the target function P(x) and generate predictive distributions for 
quantities of interest such as ®. This approach would give answers that 
would depend only on the computed values of P*(x()) at the points 
{x0}; the answers would not depend on how those points were chosen. 


Can you make a Bayesian Monte Carlo method? (See Rasmussen and 
Ghahramani (2003) for a practical attempt.) 


> 29.12 Solutions 


Solution to exercise 29.1 (p.362). We wish to show that 


ô= D wrol) 
So Wr 
converges to the expectation of ® under P. We consider the numerator and the 
denominator separately. First, the denominator. Consider a single importance 
weight 


(29.52) 


P* (a 
Wr = any (29.53) 
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What is its expectation, averaged under the distribution Q = Q*/Zg of the 
point 2")? 


(Wr) = fæ wo = fæ zro = (29.54) 


So the expectation of the denominator is 


Zp 
(x w) = Rr (29.55) 


As long as the variance of wp is finite, the denominator, divided by R, will 
converge to Zp/Zg as R increases. [In fact, the estimate converges to the 
right answer even if this variance is infinite, as long as the expectation is 
well-defined.] Similarly, the expectation of one term in the numerator is 


P* (a) 1 Zp 
Wr O(x = [ae Qo) = fae gPa) = Fo, 29.56 
(wr G(e)) Eee) FP ole) = ZE®, (29.56) 
where ® is the expectation of @ under P. So the numerator, divided by R, 
converges to Ze with increasing R. Thus ® converges to ®. 


The numerator and the denominator are unbiased estimators of RZp /Zo 
and RZp/Zg® respectively, but their ratio ® is not necessarily an unbiased 
estimator for finite R. 


Solution to exercise 29.2 (p.363). When the true density P is multimodal, it is 
unwise to use importance sampling with a sampler density fitted to one mode, 
because on the rare occasions that a point is produced that lands in one of 
the other modes, the weight associated with that point will be enormous. The 
estimates will have enormous variance, but this enormous variance may not 
be evident to the user if no points in the other modes have been seen. 


Solution to exercise 29.5 (p.371). The posterior distribution for the syndrome 
decoding problem is a pathological distribution from the point of view of Gibbs 
sampling. The factor 1[Hn = z] is 1 only on a small fraction of the space of 
possible vectors n, namely the 2" points that correspond to the valid code- 
words. No two codewords are adjacent, so similarly, any single bit flip from 
a viable state n will take us to a state with zero probability and so the state 
will never move in Gibbs sampling. 

A general code has exactly the same problem. The points corresponding 
to valid codewords are relatively few in number and they are not adjacent (at 
least for any useful code). So Gibbs sampling is no use for syndrome decoding 
for two reasons. First, finding any reasonably good hypothesis is difficult, and 
as long as the state is not near a valid codeword, Gibbs sampling cannot help 
since none of the conditional distributions is defined; and second, once we are 
in a valid hypothesis, Gibbs sampling will never take us out of it. 

One could attempt to perform Gibbs sampling using the bits of the original 
message s as the variables. This approach would not get locked up in the way 
just described, but, for a good code, any single bit flip would substantially 
alter the reconstructed codeword, so if one had found a state with reasonably 
large likelihood, Gibbs sampling would take an impractically large time to 
escape from it. 


Solution to exercise 29.12 (p.380). Each Metropolis proposal will take the 
energy of the state up or down by some amount. The total change in energy 
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when B proposals are concatenated will be the end-point of a random walk 
with B steps in it. This walk might have mean zero, or it might have a 
tendency to drift upwards (if most moves increase the energy and only a few 
decrease it). In general the latter will hold, if the acceptance rate f is small: 
the mean change in energy from any one move will be some AE > 0 and so 
the acceptance probability for the concatenation of B moves will be of order 
1/(1 + exp(—BAE)), which scales roughly as f?. The mean-square-distance 
moved will be of order fP Be?, where e is the typical step size. In contrast, 
the mean-square-distance moved when the moves are considered individually 
will be of order f Be?. 
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. . ; : Figure 29.20. Importance 
Solution to exercise 29.13 (p.382). The weights are w = P(x)/Q(x) and x is sampling in one dimension. For 


drawn from Q. The mean weight is R = 1000, 104, and 10°, the 
normalizing constant of a 
fa Q(x) [P(x)/Q(x)] = fæ P(x) =1, (29.57) Gaussian distribution (known in 
fact to be 1) was estimated using 


importance sampling with a 
sampler density of standard 
deviation o, (horizontal axis). 


assuming the integral converges. The variance is 








P@) 4] TI d ber seed 
ww) = fara [Fy -] Ce a ee 
P(x)? plots show (a) the estimated 
fe — 2P(x) + Q(x) (29.59) normalizing constant; (b) the 
Q(z) empirical standard deviation of 
ZQ xr? /2 1 the R weights; (c) 30 of the 
P q 


where Zo/Z} = oq/(v2T0ž). The integral in (29.60) is finite only if the 
coefficient of x? in the exponent is positive, i.e., if 


1 
2 2 
o> 37 (29.61) 


If this condition is satisfied, the variance is 





oa (2l -3 o2 
= 2 =- 5] -1 = — l. 29.62 


As oq approaches the critical value — about 0.70, — the variance becomes 
infinite. Figure 29.20 illustrates these phenomena for op = 1 with og varying 
from 0.1 to 1.5. The same random number seed was used for all runs, so 
the weights and estimates follow smooth curves. Notice that the empirical 
standard deviation of the R weights can look quite small and well-behaved 
(say, at og = 0.3) when the true standard deviation is nevertheless infinite. 
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Efficient Monte Carlo Methods 


This chapter discusses several methods for reducing random walk behaviour 
in Metropolis methods. The aim is to reduce the time required to obtain 
effectively independent samples. For brevity, we will say ‘independent samples’ 
when we mean ‘effectively independent samples’. 


> 30.1 Hamiltonian Monte Carlo 


The Hamiltonian Monte Carlo method is a Metropolis method, applicable 
to continuous state spaces, that makes use of gradient information to reduce 
random walk behaviour. [The Hamiltonian Monte Carlo method was originally 
called hybrid Monte Carlo, for historical reasons. | 

For many systems whose probability P(x) can be written in the form 





P(x) = , (30.1) 


not only E(x) but also its gradient with respect to x can be readily evaluated. 
It seems wasteful to use a simple random-walk Metropolis method when this 
gradient is available — the gradient indicates which direction one should go in 
to find states that have higher probability! 


Overview of Hamiltonian Monte Carlo 


In the Hamiltonian Monte Carlo method, the state space x is augmented by 
momentum variables p, and there is an alternation of two types of proposal. 
The first proposal randomizes the momentum variable, leaving the state x un- 
changed. The second proposal changes both x and p using simulated Hamil- 
tonian dynamics as defined by the Hamiltonian 


A(x, p) = E(x) + K(p), (30.2) 


where K(p) is a ‘kinetic energy’ such as K(p) = p'p/2. These two proposals 
are used to create (asymptotically) samples from the joint density 


1 1 
Pu(x, p) = >— exp[—H(x, p)] = -— exp|-E(x)] exp[-K(p)]. (30.3) 
ZH ZH 
This density is separable, so the marginal distribution of x is the desired 


distribution exp|—E(x)|/Z. So, simply discarding the momentum variables, 
we obtain a sequence of samples {xO} that asymptotically come from P(x). 


387 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 


You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


388 


= gradE ( x ) ; 


findE ( x ) ; 


for 1 = 1:L 
p = randn ( size(x) ) ; 
H=p’*p/2+E; 


xnew = X ;  gnew = g ; 
for tau = 1:Tau 


p = p - epsilon * gnew / 2 ; 
xnew = xnew + epsilon * p ; 
gnew = gradE ( xnew ) ; 

p = p - epsilon * gnew / 2 ; 


endfor 

Enew = findE ( xnew ) ; 
Hnew = p? * p / 2 + Enew ; 
dH = Hnew - H ; 


if ( dH <0) 


set gradient using initial x 
set objective function too 


loop L times 
initial momentum is Normal (0,1) 
evaluate H(x,p) 


make Tau ‘leapfrog’ steps 
make half-step in p 
make step in x 


find new gradient 
make half-step in p 


find new value of H 


# Decide whether to accept 


accept 


elseif ( rand() < exp(-dH) ) accept = 
else accept 


endif 


if ( accept ) 
g = gnew ; 
endif 


endfor 











Hamiltonian Monte Carlo 


1 








“45 -1 -0.5 0 05 1 


Simple Metropolis 
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Algorithm 30.1. Octave source 
code for the Hamiltonian Monte 
Carlo method. 


Figure 30.2. (a,b) Hamiltonian 
Monte Carlo used to generate 
samples from a bivariate Gaussian 
with correlation p = 0.998. (c,d) 
For comparison, a simple 
random-walk Metropolis method, 
given equal computer time. 
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Details of Hamiltonian Monte Carlo 


The first proposal, which can be viewed as a Gibbs sampling update, draws a 
new momentum from the Gaussian density exp|—K(p)]/Zx. This proposal is 
always accepted. During the second, dynamical proposal, the momentum vari- 
able determines where the state x goes, and the gradient of E(x) determines 
how the momentum p changes, in accordance with the equations 


x =p (30.4) 
. E(x) 
p= SS. (30.5) 


Because of the persistent motion of x in the direction of the momentum p 





during each dynamical proposal, the state of the system tends to move a 
distance that goes linearly with the computer time, rather than as the square 
root. 

The second proposal is accepted in accordance with the Metropolis rule. 
If the simulation of the Hamiltonian dynamics is numerically perfect then 
the proposals are accepted every time, because the total energy H(x,p) is a 
constant of the motion and so a in equation (29.31) is equal to one. If the 
simulation is imperfect, because of finite step sizes for example, then some of 
the dynamical proposals will be rejected. The rejection rule makes use of the 
change in H(x,p), which is zero if the simulation is perfect. The occasional 
rejections ensure that, asymptotically, we obtain samples (x, pM) from the 
required joint density Py(x, p). 

The source code in figure 30.1 describes a Hamiltonian Monte Carlo method 
that uses the ‘leapfrog’ algorithm to simulate the dynamics on the function 
findE(x), whose gradient is found by the function gradE(x). Figure 30.2 
shows this algorithm generating samples from a bivariate Gaussian whose en- 
ergy function is E(x) = 5x'Ax with 


250.25 —249.75 
im | —249.75 250.25 | i (aoe 
corresponding to a variance—covariance matrix of 
1 0.998 
| 0.998 1 | (30:7) 


In figure 30.2a, starting from the state marked by the arrow, the solid line 
represents two successive trajectories generated by the Hamiltonian dynamics. 
The squares show the endpoints of these two trajectories. Each trajectory 
consists of Tau = 19 ‘leapfrog’ steps with epsilon = 0.055. These steps are 
indicated by the crosses on the trajectory in the magnified inset. After each 
trajectory, the momentum is randomized. Here, both trajectories are accepted; 
the errors in the Hamiltonian were only +0.016 and —0.06 respectively. 
Figure 30.2b shows how a sequence of four trajectories converges from an 
initial condition, indicated by the arrow, that is not close to the typical set 
of the target distribution. The trajectory parameters Tau and epsilon were 
randomized for each trajectory using uniform distributions with means 19 and 
0.055 respectively. The first trajectory takes us to a new state, (—1.5, —0.5), 
similar in energy to the first state. The second trajectory happens to end in 
a state nearer the bottom of the energy landscape. Here, since the potential 
energy E is smaller, the kinetic energy K = p?/2 is necessarily larger than it 
was at the start of the trajectory. When the momentum is randomized before 
the third trajectory, its kinetic energy becomes much smaller. After the fourth 
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Gibbs sampling Overrelaxation Figure 30.3. Overrelaxation 
contrasted with Gibbs sampling 
for a bivariate Gaussian with 
correlation p = 0.998. (a) The 
state sequence for 40 iterations, 
each iteration involving one 
update of both variables. The 
overrelaxation method had 

a = —0.98. (This excessively large 
value is chosen to make it easy to 
see how the overrelaxation method 
reduces random walk behaviour.) 
The dotted line shows the contour 
x'S tx = 1. (b) Detail of (a), 
showing the two steps making up 
each iteration. (c) Time-course of 
the variable xı during 2000 
iterations of the two methods. 
The overrelaxation method had 

a = —0.89. (After Neal (1995).) 








(a) 





























-1 -0.8-0.6-0.4-0.2 0 


3 
2 

1 

o PAA, 

2 

3 


(c) 
Gibbs sampling 











re | 1 
0 200 400 600 800 1000 1200 1400 1600 1800 2000 

















Overrelaxation 
eer ee, hy cree 
ae Me 


trajectory has been simulated, the state appears to have become typical of the 
target density. 

Figures 30.2(c) and (d) show a random-walk Metropolis method using a 
Gaussian proposal density to sample from the same Gaussian distribution, 
starting from the initial conditions of (a) and (b) respectively. In (c) the step 
size was adjusted such that the acceptance rate was 58%. The number of 
proposals was 38 so the total amount of computer time used was similar to 
that in (a). The distance moved is small because of random walk behaviour. 
In (d) the random-walk Metropolis method was used and started from the 
same initial condition as (b) and given a similar amount of computer time. 


»> 30.2 Overrelaxation 


The method of overrelaxation is a method for reducing random walk behaviour 
in Gibbs sampling. Overrelaxation was originally introduced for systems in 
which all the conditional distributions are Gaussian. 


An example of a joint distribution that is not Gaussian but whose conditional 


distributions are all Gaussian is P(x,y) = exp(—2?y? — x? — y?)/Z. 
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Overrelaxation for Gaussian conditional distributions 
(t+1) 


In ordinary Gibbs sampling, one draws the new value 2; of the current 


variable x; from its conditional distribution, ignoring the old value ol, The 
state makes lengthy random walks in cases where the variables are strongly 
correlated, as illustrated in the left-hand panel of figure 30.3. This figure uses 
a correlated Gaussian distribution as the target density. 

In Adler’s (1981) overrelaxation method, one instead samples oth) from 
a Gaussian that is biased to the opposite side of the conditional distribution. 


If the conditional distribution of x; is Normal(u, o?) and the current value of 
£i is ol), then Adler’s method sets x; to 





att) =p a(x u) | (1 a?) ov, (30.8) 


i 
where v ~ Normal(0,1) and a is a parameter between —1 and 1, usually set to 
a negative value. (If a is positive, then the method is called under-relaxation.) 


Exercise 30.1. ] Show that this individual transition leaves invariant the con- 
> ditional distribution z; ~ Normal(u, o°). 


A single iteration of Adler’s overrelaxation, like one of Gibbs sampling, updates 
each variable in turn as indicated in equation (30.8). The transition matrix 
T(x';x) defined by a complete update of all variables in some fixed order does 
not satisfy detailed balance. Each individual transition for one coordinate 
just described does satisfy detailed balance — so the overall chain gives a valid 
sampling strategy which converges to the target density P(x) — but when we 
form a chain by applying the individual transitions in a fixed sequence, the 
overall chain is not reversible. This temporal asymmetry is the key to why 
overrelaxation can be beneficial. If, say, two variables are positively correlated, 
then they will (on a short timescale) evolve in a directed manner instead of by 
random walk, as shown in figure 30.3. This may significantly reduce the time 
required to obtain independent samples. 


Exercise 30.2.!9] The transition matrix T(x’; x) defined by a complete update 
of all variables in some fixed order does not satisfy detailed balance. If 
the updates were in a random order, then T would be symmetric. Inves- 
tigate, for the toy two-dimensional Gaussian distribution, the assertion 
that the advantages of overrelaxation are lost if the overrelaxed updates 
are made in a random order. 


Ordered Overrelaxation 


The overrelaxation method has been generalized by Neal (1995) whose ordered 
overrelaxation method is applicable to any system where Gibbs sampling is 
used. In ordered overrelaxation, instead of taking one sample from the condi- 


a) ,.2) (£) 


tional distribution P(x; | {x;};4:), we create K such samples £)”, 23,- , £} 
where K might be set to twenty or so. Often, generating K — 1 extra samples 
adds a negligible computational cost to the initial computations required for 
making the first sample. The points {zP} are then sorted numerically, and 
the current value of x; is inserted into the sorted list, giving a list of K + 1 
points. We give them ranks 0,1,2,..., K. Let « be the rank of the current 
value of x; in the list. We set x/ to the value that is an equal distance from 
the other end of the list, that is, the value with rank K — «. The role played 
by Adler’s œa parameter is here played by the parameter K. When K = 1, we 
obtain ordinary Gibbs sampling. For practical purposes Neal estimates that 
ordered overrelaxation may speed up a simulation by a factor of ten or twenty. 
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> 30.3 Simulated annealing 


A third technique for speeding convergence is simulated annealing. In simu- 
lated annealing, a ‘temperature’ parameter is introduced which, when large, 
allows the system to make transitions that would be improbable at temper- 
ature 1. The temperature is set to a large value and gradually reduced to 
1. This procedure is supposed to reduce the chance that the simulation gets 
stuck in an unrepresentative probability island. 

We asssume that we wish to sample from a distribution of the form 





P(x) = (30.9) 


where E(x) can be evaluated. In the simplest simulated annealing method, 
we instead sample from the distribution 


Pr(x)= sme T (30.10) 


and decrease T gradually to 1. 
Often the energy function can be separated into two terms, 


E(x) = E(x) + E(x), (30.11) 


of which the first term is ‘nice’ (for example, a separable function of x) and the 
second is ‘nasty’. In these cases, a better simulated annealing method might 
make use of the distribution 

P(x) = gry e Sm (30.12) 
with T gradually decreasing to 1. In this way, the distribution at high tem- 
peratures reverts to a well-behaved distribution defined by Eo. 

Simulated annealing is often used as an optimization method, where the 
aim is to find an x that minimizes E(x), in which case the temperature is 
decreased to zero rather than to 1. 

As a Monte Carlo method, simulated annealing as described above doesn’t 
sample exactly from the right distribution, because there is no guarantee that 
the probability of falling into one basin of the energy is equal to the total prob- 
ability of all the states in that basin. The closely related ‘simulated tempering’ 
method (Marinari and Parisi, 1992) corrects the biases introduced by the an- 
nealing process by making the temperature itself a random variable that is 
updated in Metropolis fashion during the simulation. Neal’s (1998) ‘annealed 
importance sampling’ method removes the biases introduced by annealing by 
computing importance weights for each generated point. 


> 30.4 Skilling’s multi-state leapfrog method 


A fourth method for speeding up Monte Carlo simulations, due to John 
Skilling, has a similar spirit to overrelaxation, but works in more dimensions. 
This method is applicable to sampling from a distribution over a continuous 
state space, and the sole requirement is that the energy E(x) should be easy 
to evaluate. The gradient is not used. This leapfrog method is not intended to 
be used on its own but rather in sequence with other Monte Carlo operators. 

Instead of moving just one state vector x around the state space, as was 
the case for all the Monte Carlo methods discussed thus far, Skilling’s leapfrog 
method simultaneously maintains a set of S state vectors {x()}, where S 
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might be six or twelve. The aim is that all S of these vectors will represent 
independent samples from the same distribution P(x). 

Skilling’s leapfrog makes a proposal for the new state xO, which is ac- 
cepted or rejected in accordance with the Metropolis method, by leapfrogging 
the current state x“) over another state vector x: 


(s) 
x 
x = xO 4 (x — x) = 2x — xl, (30.13) A 
All the other state vectors are left where they are, so the acceptance probability x), 
depends only on the change in energy of x“). 
Which vector, t, is the partner for the leapfrog event can be chosen in a 
A 


various ways. The simplest method is to select the partner at random from = x (s),_- 
the other vectors. It might be better to choose t by selecting one of the 
nearest neighbours x“) — nearest by any chosen distance function — as long 
as one then uses an acceptance rule that ensures detailed balance by checking 
whether point t is still among the nearest neighbours of the new point, xo, 


Why the leapfrog is a good idea 


Imagine that the target density P(x) has strong correlations — for example, 
the density might be a needle-like Gaussian with width € and length Le, where 
L > 1. As we have emphasized, motion around such a density by standard 
methods proceeds by a slow random walk. 

Imagine now that our set of S points is lurking initially in a location that 
is probable under the density, but in an inappropriately small ball of size e. 
Now, under Skilling’s leapfrog method, a typical first move will take the point 
a little outside the current ball, perhaps doubling its distance from the centre 
of the ball. After all the points have had a chance to move, the ball will have 
increased in size; if all the moves are accepted, the ball will be bigger by a 
factor of two or so in all dimensions. The rejection of some moves will mean 
that the ball containing the points will probably have elongated in the needle’s 
long direction by a factor of, say, two. After another cycle through the points, 
the ball will have grown in the long direction by another factor of two. So the 
typical distance travelled in the long dimension grows exponentially with the 
number of iterations. 

Now, maybe a factor of two growth per iteration is on the optimistic side; 
but even if the ball only grows by a factor of, let’s say, 1.1 per iteration, the 
growth is nevertheless exponential. It will only take a number of iterations 
proportional to log L/log(1.1) for the long dimension to be explored. 


> Exercise 30.3.1% P-398] Discuss how the effectiveness of Skilling’s method scales 
with dimensionality, using a correlated N-dimensional Gaussian distri- 
bution as an example. Find an expression for the rejection probability, 
assuming the Markov chain is at equilibrium. Also discuss how it scales 
with the strength of correlation among the Gaussian variables. (Hint: 
Skilling’s method is invariant under affine transformations, so the rejec- 
tion probability at equilibrium can be found by looking at the case of a 
separable Gaussian.| 


This method has some similarity to the ‘adaptive direction sampling’ method 
of Gilks et al. (1994) but the leapfrog method is simpler and can be applied 
to a greater variety of distributions. 
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> 30.5 Monte Carlo algorithms as communication channels 


It may be a helpful perspective, when thinking about speeding up Monte Carlo 
methods, to think about the information that is being communicated. Two 
communications take place when a sample from P(x) is being generated. 

First, the selection of a particular x from P(x) necessarily requires that 
at least log 1/P(x) random bits be consumed. [Recall the use of inverse arith- 
metic coding as a method for generating samples from given distributions 
(section 6.3).] 

Second, the generation of a sample conveys information about P(x) from 
the subroutine that is able to evaluate P*(x) (and from any other subroutines 
that have access to properties of P*(x)). 

Consider a dumb Metropolis method, for example. In a dumb Metropolis 
method, the proposals Q(x’;x) have nothing to do with P(x). Properties 
of P(x) are only involved in the algorithm at the acceptance step, when the 
ratio P*(x’)/P*(x) is computed. The channel from the true distribution P(x) 
to the user who is interested in computing properties of P(x) thus passes 
through a bottleneck: all the information about P is conveyed by the string of 
acceptances and rejections. If P(x) were replaced by a different distribution 
P(x), the only way in which this change would have an influence is that the 
string of acceptances and rejections would be changed. I am not aware of much 
use being made of this information-theoretic view of Monte Carlo algorithms, 
but I think it is an instructive viewpoint: if the aim is to obtain information 
about properties of P(x) then presumably it is helpful to identify the channel 
through which this information flows, and maximize the rate of information 
transfer. 


Example 30.4. The information-theoretic viewpoint offers a simple justification 
for the widely-adopted rule of thumb, which states that the parameters of 
a dumb Metropolis method should be adjusted such that the acceptance 
rate is about one half. Let’s call the acceptance history, that is, the 
binary string of accept or reject decisions, a. The information learned 
about P(x) after the algorithm has run for T steps is less than or equal to 
the information content of a, since all information about P is mediated 
by a. And the information content of a is upper-bounded by THə(f), 
where f is the acceptance rate. This bound on information acquired 
about P is maximized by setting f = 1/2. 


Another helpful analogy for a dumb Metropolis method is an evolutionary 
one. Each proposal generates a progeny x’ from the current state x. These two 
individuals then compete with each other, and the Metropolis method uses a 
noisy survival-of-the-fittest rule. If the progeny x’ is fitter than the parent (i.e., 
P*(x’) > P*(x), assuming the Q/Q factor is unity) then the progeny replaces 
the parent. The survival rule also allows less-fit progeny to replace the parent, 
sometimes. Insights about the rate of evolution can thus be applied to Monte 
Carlo methods. 


Exercise 30.5.'7] Let x € {0,1}° and let P(x) be a separable distribution, 


P(x) =] [ 2), (30.14) 


with p(0) = po and p(1) = pi, for example pı = 0.1. Let the proposal 
density of a dumb Metropolis algorithm Q involve flipping a fraction m 
of the G bits in the state x. Analyze how long it takes for the chain to 
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converge to the target density as a function of m. Find the optimal m 
and deduce how long the Metropolis method must run for. 


Compare the result with the results for an evolving population under 
natural selection found in Chapter 19. 


The insight that the fastest progress that a standard Metropolis method 
can make, in information terms, is about one bit per iteration, gives a strong 
motivation for speeding up the algorithm. This chapter has already reviewed 
several methods for reducing random-walk behaviour. Do these methods also 
speed up the rate at which information is acquired? 


Exercise 30.6.4] Does Gibbs sampling, which is a smart Metropolis method 
whose proposal distributions do depend on P(x), allow information about 
P(x) to leak out at a rate faster than one bit per iteration? Find toy 
examples in which this question can be precisely investigated. 


Exercise 30.7.'4] Hamiltonian Monte Carlo is another smart Metropolis method 
in which the proposal distributions depend on P(x). Can Hamiltonian 
Monte Carlo extract information about P(x) at a rate faster than one 
bit per iteration? 


Exercise 30.8.15] In importance sampling, the weight w, = P*(x(”)/Q*(x™), 
a floating-point number, is computed and retained until the end of the 
computation. In contrast, in the dumb Metropolis method, the ratio 
a = P*(x’)/P*(x) is reduced to a single bit (‘is a bigger than or smaller 
than the random number w?’). Thus in principle importance sampling 
preserves more information about P* than does dumb Metropolis. Can 
you find a toy example in which this extra information does indeed lead 
to faster convergence of importance sampling than Metropolis? Can 
you design a Markov chain Monte Carlo algorithm that moves around 
adaptively, like a Metropolis method, and that retains more useful in- 
formation about the value of P*, like importance sampling? 


In Chapter 19 we noticed that an evolving population of N individuals can 
make faster evolutionary progress if the individuals engage in sexual reproduc- 
tion. This observation motivates looking at Monte Carlo algorithms in which 
multiple parameter vectors x are evolved and interact. 


> 30.6 Multi-state methods 


In a multi-state method, multiple parameter vectors x are maintained; they 
evolve individually under moves such as Metropolis and Gibbs; there are also 
interactions among the vectors. The intention is either that eventually all the 
vectors x should be samples from P(x) (as illustrated by Skilling’s leapfrog 
method), or that information associated with the final vectors x should allow 
us to approximate expectations under P(x), as in importance sampling. 


Genetic methods 


Genetic algorithms are not often described by their proponents as Monte Carlo 
algorithms, but I think this is the correct categorization, and an ideal genetic 
algorithm would be one that can be proved to be a valid Monte Carlo algorithm 
that converges to a specified density. 
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Pll use R to denote the number of vectors in the population. We aim to 
have P*({x()}#) = J| P*(x)). A genetic algorithm involves moves of two or 
three types. 

First, individual moves in which one state vector is perturbed, x") — xr)" 
which could be performed using any of the Monte Carlo methods we have 
mentioned so far. 

Second, we allow crossover moves of the form x,y — x’,y’; in a typical 
crossover move, the progeny x’ receives half his state vector from one parent, 
x, and half from the other, y; the secret of success in a genetic algorithm is 
that the parameter x must be encoded in such a way that the crossover of 
two independent states x and y, both of which have good fitness P*, should 
have a reasonably good chance of producing progeny who are equally fit. This 
constraint is a hard one to satisfy in many problems, which is why genetic 
algorithms are mainly talked about and hyped up, and rarely used by serious 
experts. Having introduced a crossover move x,y — x’, y’, we need to choose 
an acceptance rule. One easy way to obtain a valid algorithm is to accept or 
reject the crossover proposal using the Metropolis rule with Pry) as 
the target density — this involves comparing the fitnesses before and after the 
crossover using the ratio 

P*(x')P* (y^) 

P*(x)P*(y) 
If the crossover operator is reversible then we have an easy proof that this 
procedure satisfies detailed balance and so is a valid component in a chain 
converging to P*({x")}), 


(30.15) 


> Exercise 30.9.19] Discuss whether the above two operators, individual varia- 
tion and crossover with the Metropolis acceptance rule, will give a more 
efficient Monte Carlo method than a standard method with only one 
state vector and no crossover. 


The reason why the sexual community could acquire information faster than 
the asexual community in Chapter 19 was because the crossover operation 
produced diversity with standard deviation /G, then the Blind Watchmaker 
was able to convey lots of information about the fitness function by killing 
off the less fit offspring. The above two operators do not offer a speed-up of 
VG compared with standard Monte Carlo methods because there is no killing. 
What’s required, in order to obtain a speed-up, is two things: multiplication 
and death; and at least one of these must operate selectively. Either we must 
kill off the less-fit state vectors, or we must allow the more-fit state vectors to 
give rise to more offspring. While it’s easy to sketch these ideas, it is hard to 
define a valid method for doing it. 


Exercise 30.10.17] Design a birth rule and a death rule such that the chain 
converges to P*({x()}#). 


I believe this is still an open research problem. 


Particle filters 


Particle filters, which are particularly popular in inference problems involving 
temporal tracking, are multistate methods that mix the ideas of importance 
sampling and Markov chain Monte Carlo. See Isard and Blake (1996), Isard 
and Blake (1998), Berzuini et al. (1997), Berzuini and Gilks (2001), Doucet 
et al. (2001). 
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> 30.7 Methods that do not necessarily help 


It is common practice to use many initial conditions for a particular Markov 
chain (figure 29.19). If you are worried about sampling well from a complicated 
density P(x), can you ensure the states produced by the simulations are well 
distributed about the typical set of P(x) by ensuring that the initial points 
are ‘well distributed about the whole state space’? 

The answer is, unfortunately, no. In hierarchical Bayesian models, for 
example, a large number of parameters {£n} may be coupled together via an- 
other parameter 3 (known as a hyperparameter). For example, the quantities 
{£n} might be independent noise signals, and 3 might be the inverse-variance 
of the noise source. The joint distribution of @ and {xn} might be 


N 


P(B,{tn}) = P(B) |] P(enl| 6) 


N 
= ] [ 1 .-Bx7,/2 
T" P(6) Z(G) e , 


where Z(3) = \/27/8 and P(8) is a broad distribution describing our igno- 
rance about the noise level. For simplicity, let’s leave out all the other variables 
— data and such — that might be involved in a realistic problem. Let’s imagine 
that we want to sample effectively from P(8,{£n}) by Gibbs sampling — alter- 
nately sampling 8 from the conditional distribution P( | £n) then sampling all 
the x, from their conditional distributions P(x, | 3). [The resulting marginal 
distribution of 3 should asymptotically be the broad distribution P(8).] 

If N is large then the conditional distribution of 8 given any particular 
setting of {£n } will be tightly concentrated on a particular most-probable value 
of 8, with width proportional to 1//N. Progress up and down the (-axis will 
therefore take place by a slow random walk with steps of size œ 1/ VN. 

So, to the initialization strategy. Can we finesse our slow convergence 
problem by using initial conditions located ‘all over the state space’? Sadly, 
no. If we distribute the points {zn} widely, what we are actually doing is 
favouring an initial value of the noise level 1/3 that is large. The random 
walk of the parameter 8 will thus tend, after the first drawing of @ from 
P(G| xy), always to start off from one end of the (-axis. 


Further reading 


The Hamiltonian Monte Carlo method (Duane et al., 1987) is reviewed in Neal 
(1993b). This excellent tome also reviews a huge range of other Monte Carlo 
methods, including the related topics of simulated annealing and free energy 
estimation. 


> 30.8 Further exercises 


Exercise 30.11.'4] An important detail of the Hamiltonian Monte Carlo method 
is that the simulation of the Hamiltonian dynamics, while it may be in- 
accurate, must be perfectly reversible, in the sense that if the initial con- 
dition (x, p) goes to (x’, p’), then the same simulator must take (x’, —p’) 
to (x,—p), and the inaccurate dynamics must conserve state-space vol- 
ume. [The leapfrog method in algorithm 30.1 satisfies these rules.] 


Explain why these rules must be satisfied and create an example illus- 
trating the problems that arise if they are not. 
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Exercise 30.12.14] A multi-state idea for slice sampling. Investigate the follow- 
ing multi-state method for slice sampling. As in Skilling’s multi-state 
leapfrog method (section 30.4), maintain a set of S state vectors {x}. 
Update one state vector x‘) by one-dimensional slice sampling in a di- 


rection y determined by picking two other state vectors x) and x) Ax) 
at random and setting y = x) — x"). Investigate this method on toy wa 
problems such as a highly-correlated multivariate Gaussian distribution. hn x's) y Cta 

Bear in mind that if S — 1 is smaller than the number of dimensions Í a rr 


N then this method will not be ergodic by itself, so it may need to be 
mixed with other methods. Are there classes of problems that are better 
solved by this slice-sampling method than by the standard methods for 
picking y such as cycling through the coordinate axes or picking u at 
random from a Gaussian distribution? 


> 30.9 Solutions 


Solution to exercise 30.3 (p.393). Consider the spherical Gaussian distribution 
where all components have mean zero and variance 1. In one dimension, the 


(2) 


nth, if a) leapfrogs over x7’, we obtain the proposed coordinate 
(1Y = 24) — 2), (30.16) 
Assuming that a) and ae?) are Gaussian random variables from Normal(0, 1), 
(a) is Gaussian from Normal(0, 07), where o? = 2?+(—1)? = 5. The change 
in energy contributed by this one dimension will be 
1 


5 | (2a? — 280)? — (al)?] = 2(e!?))? — 22) 0l) (30.17) 


so the typical change in energy is 2((a?))2) = 2. This positive change is bad 
news. In N dimensions, the typical change in energy when a leapfrog move is 
made, at equilibrium, is thus +2. The probability of acceptance of the move 
scales as 

eres (30.18) 


This implies that Skilling’s method, as described, is not effective in very high- 
dimensional problems — at least, not once convergence has occurred. Nev- 
ertheless it has the impressive advantage that its convergence properties are 
independent of the strength of correlations between the variables — a property 
that not even the Hamiltonian Monte Carlo and overrelaxation methods offer. 
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About Chapter 31 


Some of the neural network models that we will encounter are related to Ising 
models, which are idealized magnetic systems. It is not essential to understand 
the statistical physics of Ising models to understand these neural networks, but 
I hope you'll find them helpful. 

Ising models are also related to several other topics in this book. We will 
use exact tree-based computation methods like those introduced in Chapter 
25 to evaluate properties of interest in Ising models. Ising models offer crude 
models for binary images. And Ising models relate to two-dimensional con- 
strained channels (cf. Chapter 17): a two-dimensional bar-code in which a 
black dot may not be completely surrounded by black dots, and a white dot 
may not be completely surrounded by white dots, is similar to an antiferro- 
magnetic Ising model at low temperature. Evaluating the entropy of this Ising 
model is equivalent to evaluating the capacity of the constrained channel for 
conveying bits. 

If you would like to jog your memory on statistical physics and thermody- 
namics, you might find Appendix B helpful. I also recommend the book by 
Reif (1965). 
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Ising Models 


An Ising model is an array of spins (e.g., atoms that can take states +1) that 
are magnetically coupled to each other. If one spin is, say, in the +1 state 
then it is energetically favourable for its immediate neighbours to be in the 
same state, in the case of a ferromagnetic model, and in the opposite state, in 





the case of an antiferromagnet. In this chapter we discuss two computational 
techniques for studying Ising models. 

Let the state x of an Ising model with N spins be a vector in which each 
component x, takes values —1 or +1. If two spins m and n are neighbours we 
write (m,n) € M. The coupling between neighbouring spins is J. We define 
Jmn = J if m and n are neighbours and Jmn = 0 otherwise. The energy of a 
state x is 


1 
E(x; J, H) =—|5 9) Imntmin + > Han| , (31.1) 


where H is the applied field. If J > 0 then the model is ferromagnetic, and 
if J < 0 it is antiferromagnetic. We’ve included the factor of 1/2 because each 
pair is counted twice in the first sum, once as (m,n) and once as (n,m). At 
equilibrium at temperature T, the probability that the state is x is 


1 
P(x|6,J,H) = Wen J, H)), (31.2) 
where 8 = 1/kgT, kg is Boltzmann’s constant, and 
Z(B, J, H) = $ exp[-GE(x; J, H)). (31.3) 


Relevance of Ising models 


Ising models are relevant for three reasons. 

Ising models are important first as models of magnetic systems that have 
a phase transition. The theory of universality in statistical physics shows that 
all systems with the same dimension (here, two), and the same symmetries, 
have equivalent critical properties, i.e., the scaling laws shown by their phase 
transitions are identical. So by studying Ising models we can find out not only 
about magnetic phase transitions but also about phase transitions in many 
other systems. 

Second, if we generalize the energy function to 


1 
E(x;J,h) = — 5 5 Tippee 5 hn&n} , (31.4) 


where the couplings Jmn and applied fields h, are not constant, we obtain 
a family of models known as ‘spin glasses’ to physicists, and as ‘Hopfield 
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networks’ or ‘Boltzmann machines’ to the neural network community. In some 
of these models, all spins are declared to be neighbours of each other, in which 
case physicists call the system an ‘infinite-range’ spin glass, and networkers 
call it a ‘fully connected’ network. 
Third, the Ising model is also useful as a statistical model in its own right. 
In this chapter we will study Ising models using two different computational 
techniques. 


Some remarkable relationships in statistical physics 


We would like to get as much information as possible out of our computations. 
Consider for example the heat capacity of a system, which is defined to be 


Oe 
C= 58, (31.5) 


where 


= =) exp x)) E(x). (31.6) 


È 


To work out the heat capacity of a system, we might naively guess that we have 
to increase the temperature and measure the energy change. Heat capacity, 
however, is intimately related to energy fluctuations at constant temperature. 
Let’s start from the partition function, 


Z = X exp(-6E(x)). (31.7) 
The mean energy is obtained by differentiation with respect to 8: 
—- ge - E(x) exp(—BE(x)) = —E. (31.8) 
A further differentiation spits out the variance of the energy: 


a -e 7 C) 2 exp(—GE(x)) — F? = (E?) — E? = var(E). (31.9) 


But the heat capacity is also the derivative of E with respect to temperature: 


OF ð OlnZ 0? ln Z OB 





= = = = 2 
a of oz og ar PM 1/kpT*). (31.10) 
So for any system at temperature T, 
= var(E) E > 
C= [ope = keô" var(E). (31.11) 


Thus if we can observe the variance of the energy of a system at equilibrium, 
we can estimate its heat capacity. 

I find this an almost paradoxical relationship. Consider a system with 
a finite set of states, and imagine heating it up. At high temperature, all 
states will be equiprobable, so the mean energy will be essentially constant 
and the heat capacity will be essentially zero. But on the other hand, with 
all states being equiprobable, there will certainly be fluctuations in energy. 
So how can the heat capacity be related to the fluctuations? The answer is 
in the words ‘essentially zero’ above. The heat capacity is not quite zero at 


high temperature, it just tends to zero. And it tends to zero as =, with 
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the quantity var(E) tending to a constant at high temperatures. This 1/T? 
behaviour of the heat capacity of finite systems at high temperatures is thus 
very general. 

The 1/T? factor can be viewed as an accident of history. If only tem- 
perature scales had been defined using 8 = ET then the definition of heat 
capacity would be 


c®) = on = var(E), (31.12) 


and heat capacity and fluctuations would be identical quantities. 


> Exercise 31.1.1?! [We will call the entropy of a physical system S rather than 
H, while we are in a statistical physics chapter; we set kg = 1.] 


The entropy of a system whose states are x, at temperature T = 1/(, is 


S= X p(x)[ln1/p(x)] (31.13) 
where i 
p(x) = Ze) exp|—GE(x)]. (31.14) 
(a) Show that 
S = In Z(6) + BE(B) (31.15) 


where E (3) is the mean energy of the system. 


(b) Show that 


OF 
s=- (31.16) 


where the free energy F = —kT ln Z and kT = 1/8. 


> 31.1 Ising models — Monte Carlo simulation 


In this section we study two-dimensional planar Ising models using a simple 
Gibbs-sampling method. Starting from some initial state, a spin n is selected 
at random, and the probability that it should be +1 given the state of the 
other spins and the temperature is computed, 


1 


P(+1 lbn) = ——— 11 
(+1] ba) 1 + exp(—2b,) Cr 
where 3 = 1/kgT and by is the local field 
by = 5 Jim + H. (31.18) 


m:(m,nyEN 


[The factor of 2 appears in equation (31.17) because the two spin states are 
{+1,—1} rather than {+1,0}.] Spin n is set to +1 with that probability, 
and otherwise to —1; then the next spin to update is selected at random. 
After sufficiently many iterations, this procedure converges to the equilibrium 
distribution (31.2). An alternative to the Gibbs sampling formula (31.17) is 
the Metropolis algorithm, in which we consider the change in energy that 
results from flipping the chosen spin from its current state rn, 


AE = 22nbn, (31.19) 
and adopt this change in configuration with probability 


1 AE <0 


exp(—GAE) AE>0. neo 


P(accept; AE, 3) = { 
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This procedure has roughly double the probability of accepting energetically Ir | | | dd 

unfavourable moves, so may be a more efficient sampler — but at very low tem- ] | | | ] ] 

peratures the relative merits of Gibbs sampling and the Metropolis algorithm ‘a coe (a ee 

may be subtle. 9—e—0— 99-9 


Rectangular geometry Figure 31.1. Rectangular Ising 





I first simulated an Ising model with the rectangular geometry shown in fig- nea 
ure 31.1, and with periodic boundary conditions. A line between two spins T 
indicates that they are neighbours. I set the external field H = 0 and con- 

sidered the two cases J = +1, which are a ferromagnet and antiferromagnet 
respectively. 


I started at a large temperature (T = 33, 8 = 0.03) and changed the temper- 
ature every J iterations, first decreasing it gradually to T=0.1,G=10, then 
increasing it gradually back to a large temperature again. This procedure 


ol 


gives a crude check on whether ‘equilibrium has been reached’ at each tem- 

perature; if not, we’d expect to see some hysteresis in the graphs we plot. It 

also gives an idea of the reproducibility of the results, if we assume that the two 

runs, with decreasing and increasing temperature, are effectively independent 

of each other. 2.5 
At each temperature I recorded the mean energy per spin and the standard 

deviation of the energy, and the mean square value of the magnetization m, 


MS SP tr. (31.21) 


2.4 
One tricky decision that has to be made is how soon to start taking these 


measurements after a new temperature has been established; it is difficult to 

detect ‘equilibrium’ — or even to give a clear definition of a system’s being ‘at 
equilibrium’! [But in Chapter 32 we will see a solution to this problem.] My 

crude strategy was to let the number of iterations at each temperature, J, be 

a few hundred times the number of spins N, and to discard the first 1/3 of 2.3 
those iterations. With N= 100, I found I needed more than 100000 iterations 

to reach equilibrium at any given temperature. 


Results for small N with J = 1. 


I simulated an | x l grid for | = 4,5,...,10,40,64. Let’s have a quick think 2 
about what results we expect. At low temperatures the system is expected 





Figure 31.2. Sample states of 
to be in a ground state. The rectangular Ising model with J = 1 has two rectangular Ising models with 


ground states, the all +1 state and the all —1 state. The energy per spin of J = 1 at a sequence of 
either ground state is —2. At high temperatures, the spins are independent, temperatures T. 
all states are equally probable, and the energy is expected to fluctuate around 
a mean of 0 with a standard deviation proportional to 1/ VN. 
Let’s look at some results. In all figures temperature T is shown with 
kg = 1. The basic picture emerges with as few as 16 spins (figure 31.3, 
top): the energy rises monotonically. As we increase the number of spins to 
100 (figure 31.3, bottom) some new details emerge. First, as expected, the 
fluctuations at large temperature decrease as 1/ VN. Second, the fluctuations 
at intermediate temperature become relatively bigger. This is the signature 
of a ‘collective phenomenon’, in this case, a phase transition. Only systems 
with infinite N show true phase transitions, but with N = 100 we are getting 
a hint of the critical fluctuations. Figure 31.5 shows details of the graphs for 
N = 100 and N = 4096. Figure 31.2 shows a sequence of typical states from 
the simulation of N = 4096 spins at a sequence of decreasing temperatures. 
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Contrast with Schottky anomaly 


A peak in the heat capacity, as a function of temperature, occurs in any system 
that has a finite number of energy levels; a peak is not in itself evidence of a 
phase transition. Such peaks were viewed as anomalies in classical thermody- 
namics, since ‘normal’ systems with infinite numbers of energy levels (such as 
a particle in a box) have heat capacities that are either constant or increasing 
functions of temperature. In contrast, systems with a finite number of levels 
produced small blips in the heat capacity graph (figure 31.4). 

Let us refresh our memory of the simplest such system, a two-level system 
with states x = 0 (energy 0) and x = 1 (energy €). The mean energy is 


exp(— 6e) _ 1 





E = = 31.22 
O) = «Ty exp(—B6) ~ “T+ exp(Be) oe 

and the derivative with respect to ( is 
dr/dg2 2 eos 31.23 
(8 = e TT exp R 

So the heat capacity is 
dE 1 e exp((e) 

= dE/dT = -— — = SS 1.24 
C =4E/AT = TS aT? ~ aT ppe Sa 


and the fluctuations in energy are given by var(E) = CkgT? = —dE/d£, 
which was evaluated in (31.23). The heat capacity and fluctuations are plotted 
in figure 31.6. The take-home message at this point is that whilst Schottky 
anomalies do have a peak in the heat capacity, there is no peak in their 
fluctuations; the variance of the energy simply increases monotonically with 
temperature to a value proportional to the number of independent spins. Thus 
it is a peak in the fluctuations that is interesting, rather than a peak in the 
heat capacity. The Ising model has such a peak in its fluctuations, as can be 
seen in the second row of figure 31.5. 


Rectangular Ising model with J = —1 


What do we expect to happen in the case J = —1? The ground states of an 
infinite system are the two checkerboard patterns (figure 31.7), and they have 
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Figure 31.3. Monte Carlo 
simulations of rectangular Ising 
models with J = 1. Mean energy 
and fluctuations in energy as a 
function of temperature (left). 
Mean square magnetization as a 
function of temperature (right). 
In the top row, N = 16, and the 
bottom, N = 100. For even larger 
N, see later figures. 





T 


Figure 31.4. Schematic diagram to 
explain the meaning of a Schottky 
anomaly. The curve shows the 
heat capacity of two gases as a 
function of temperature. The 
lower curve shows a normal gas 
whose heat capacity is an 
increasing function of 
temperature. The upper curve has 
a small peak in the heat capacity, 
which is known as a Schottky 
anomaly (at least in Cambridge). 
The peak is produced by the gas 
having magnetic degrees of 
freedom with a finite number of 
accessible states. 
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Figure 31.5. Detail of Monte Carlo 
simulations of rectangular Ising 
models with J = 1. (a) Mean 
energy and fluctuations in energy 
as a function of temperature. (b) 
Fluctuations in energy (standard 
deviation). (c) Mean square 
magnetization. (d) Heat capacity. 


Figure 31.6. Schottky anomaly — 
Heat capacity and fluctuations in 
energy as a function of 
temperature for a two-level system 
with separation € = 1 and kg = 1. 
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energy per spin —2, like the ground states of the J =1 model. Can this analogy 
be pressed further? A moment’s reflection will confirm that the two systems 
are equivalent to each other under a checkerboard symmetry operation. If you 
take an infinite J = 1 system in some state and flip all the spins that lie on 
the black squares of an infinite checkerboard, and set J = —1 (figure 31.8), 
then the energy is unchanged. (The magnetization changes, of course.) So all 
thermodynamic properties of the two systems are expected to be identical in 
the case of zero applied field. 

But there is a subtlety lurking here. Have you spotted it? We are simu- 
lating finite grids with periodic boundary conditions. If the size of the grid in 
any direction is odd, then the checkerboard operation is no longer a symme- 
try operation relating J = +1 to J = —1, because the checkerboard doesn’t 
match up at the boundaries. This means that for systems of odd size, the 
ground state of a system with J = —1 will have degeneracy greater than 2, 
and the energy of those ground states will not be as low as —2 per spin. So we 
expect qualitative differences between the cases J = +1 in odd-sized systems. 
These differences are expected to be most prominent for small systems. The 
frustrations are introduced by the boundaries, and the length of the boundary 
grows as the square root of the system size, so the fractional influence of this 
boundary-related frustration on the energy and entropy of the system will de- 
crease as 1/ VN. Figure 31.9 compares the energies of the ferromagnetic and 
antiferromagnetic models with N = 25. Here, the difference is striking. 
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Triangular Ising model 


We can repeat these computations for a triangular Ising model. Do we expect 
the triangular Ising model with J = +1 to show different physical properties 
from the rectangular Ising model? Presumably the J = 1 model will have 
broadly similar properties to its rectangular counterpart. But the case J = —1 
is radically different from what’s gone before. Think about it: there is no 
unfrustrated ground state; in any state, there must be frustrations — pairs of 
neighbours who have the same sign as each other. Unlike the case of the 
rectangular model with odd size, the frustrations are not introduced by the 
periodic boundary conditions. Every set of three mutually neighbouring spins 
must be in a state of frustration, as shown in figure 31.10. (Solid lines show 
‘happy’ couplings which contribute —|J| to the energy; dashed lines show 
‘unhappy’ couplings which contribute |J|.) Thus we certainly expect different 
behaviour at low temperatures. In fact we might expect this system to have 
a non-zero entropy at absolute zero. (‘Triangular model violates third law of 
thermodynamics!’) 

Let’s look at some results. Sample states are shown in figure 31.12, and 
figure 31.11 shows the energy, fluctuations, and heat capacity for N = 4096. 
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ae 


Figure 31.7. The two ground 
states of a rectangular Ising model 
with J = —1. 


J=-1 J=+1 


Pd 2 


Figure 31.8. Two states of 
rectangular Ising models with 
= +1 that have identical energy. 





Figure 31.9. Monte Carlo 
simulations of rectangular Ising 
models with J = +1 and N = 25. 
Mean energy and fluctuations in 
energy as a function of 





temperature. 
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Figure 31.10. In an 
antiferromagnetic triangular Ising 
model, any three neighbouring 
spins are frustrated. Of the eight 
possible configurations of three 
spins, six have energy —|.J| (a), 
and two have energy 3|J| (b). 
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Note how different the results for J = +1 are. There is no peak at all in 
the standard deviation of the energy in the case J = —1. This indicates that 
the antiferromagnetic system does not have a phase transition to a state with 
long-range order. 
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> 31.2 Direct computation of partition function of Ising models 


We now examine a completely different approach to Ising models. The trans- 
fer matrix method is an exact and abstract approach that obtains physical 
properties of the model from the partition function 


Z(B,J,b) = X` exp[-BE(x;J,b)], (31.25) 


where the summation is over all states x, and the inverse temperature is 
B =1/T. [As usual, Let kp = 1.] The free energy is given by F = —3in Z. 
The number of states is 2%, so direct computation of the partition function 
is not possible for large N. To avoid enumerating all global states explicitly, 
we can use a trick similar to the sum—product algorithm discussed in Chapter 
25. We concentrate on models that have the form of a long thin strip of width 
W with periodic boundary conditions in both directions, and we iterate along 
the length of our model, working out a set of partial partition functions at one 
location | in terms of partial partition functions at the previous location | — 1. 
Each iteration involves a summation over all the states at the boundary. This 
operation is exponential in the width of the strip, W. The final clever trick 
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Figure 31.12. Sample states of 

triangular Ising models with J = 1 
i : : and J = —1. High temperatures 

2 oe at at the top; low at the bottom. 





31.2: Direct computation of partition function of Ising models 


is to note that if the system is translation-invariant along its length then we 
need to do only one iteration in order to find the properties of a system of any 
length. 

The computational task becomes the evaluation of an S x S matrix, where 
S is the number of microstates that need to be considered at the boundary, 
and the computation of its eigenvalues. The eigenvalue of largest magnitude 
gives the partition function for an infinite-length thin strip. 

Here is a more detailed explanation. Label the states of the C columns of 
the thin strip s,,s2,...,8¢, with each s an integer from 0 to 2” —1. The rth 
bit of se indicates whether the spin in row r, column c is up or down. The 
partition function is 


Z = J ep bE(x)) (31.26) 
C 

= LY Leo,- tos), (31.27) 
Si 82 SC c=1 


where €(5¢, Sc+1) is an appropriately defined energy, and, if we want periodic 
boundary conditions, sc+1 is defined to be sı. One definition for € is: 


E(Se, Se41) = 5 J LmEn + t 5 Imain + + 5 J Emtas (31.28) 
(m, n)EN: (m,n)EN: (m,n)EN: 
mEc,nEc+1 mEc,nEc mEc+1,nEc+1 


This definition of the energy has the nice property that (for the rectangular 
Ising model) it defines a matrix that is symmetric in its two indices Se, Se+1- 
The factors of 1/4 are needed because vertical links are counted four times. 
Let us define 





Mss = exp(—GE(s, s’)) . (31.29) 
Then continuing from equation (31.27), 
Cc 
B= Se A, TM (31.30) 
sı 82 se Le=1 
= Trace [M°] (31.31) 
(31.32) 


= as 
a 


where { uaki are the eigenvalues of M. As the length of the strip C increases, 
Z becomes dominated by the largest eigenvalue [max: 


Ze (31.33) 


So the free energy per spin in the limit of an infinite thin strip is given by: 
f = —kT ln Z/(WC) = —kTC ln pmax/(WC) = —kT ln umax/W. (31.34) 


It’s really neat that all the thermodynamic properties of a long thin strip can 
be obtained from just the largest eigenvalue of this matrix M! 


Computations 


I computed the partition functions of long-thin-strip Ising models with the 
geometries shown in figure 31.14. 

As in the last section, I set the applied field H to zero and considered the 
two cases J = +1 which are a ferromagnet and antiferromagnet respectively. I 
computed the free energy per spin, f(3,J,H) = F/N for widths from W = 2 
to 8 as a function of 8 for H = 0. 
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Figure 31.13. Illustration to help 
explain the definition (31.28). 
E(s2, 83) counts all the 
contributions to the energy in the 
rectangle. The total energy is 
given by stepping the rectangle 
along. Each horizontal bond 
inside the rectangle is counted 
once; each vertical bond is 
half-inside the rectangle (and will 
be half-inside an adjacent 
rectangle) so half its energy is 
included in €(s2, s3); the factor of 
1/4 appears in the second term 
because m and n both run over all 
nodes in column c, so each bond is 
visited twice. 

For the state shown here, 
S2 = (100)2, S3 = (110)2, the 
horizontal bonds contribute +J to 
E(s2, 83), and the vertical bonds 
contribute —J/2 on the left and 
—J/2 on the right, assuming 
periodic boundary conditions 
between top and bottom. So 
E (s2, s3) =0. 
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Computational ideas: 


Only the largest eigenvalue is needed. There are several ways of getting this 
quantity, for example, iterative multiplication of the matrix by an initial vec- 
tor. Because the matrix is all positive we know that the principal eigenvector 
is all positive too (Frobenius—Perron theorem), so a reasonable initial vector is 
(1,1,...,1). This iterative procedure may be faster than explicit computation 
of all eigenvalues. I computed them all anyway, which has the advantage that 
we can find the free energy of finite length strips — using equation (31.32) — as 
well as infinite ones. 


Ferromagnets of width 8 Antiferromagnets of width 8 
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Comments on graphs: 


For large temperatures all Ising models should show the same behaviour: the 
free energy is entropy-dominated, and the entropy per spin is In(2). The mean 
energy per spin goes to zero. The free energy per spin should tend to —In(2)/{. 
The free energies are shown in figure 31.15. 

One of the interesting properties we can obtain from the free energy is 
the degeneracy of the ground state. As the temperature goes to zero, the 
Boltzmann distribution becomes concentrated in the ground state. If the 
ground state is degenerate (i.e., there are multiple ground states with identical 
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Figure 31.14. Two long-thin-strip 
Ising models. A line between two 
spins indicates that they are 
neighbours. The strips have width 
W and infinite length. 


Figure 31.15. Free energy per spin 
of long-thin-strip Ising models. 
Note the non-zero gradient at 

T =0 in the case of the triangular 
antiferromagnet. 


Figure 31.16. Entropies (in nats) 
of width 8 Ising systems as a 
function of temperature, obtained 
by differentiating the free energy 
curves in figure 31.15. The 
rectangular ferromagnet and 
antiferromagnet have identical 
thermal properties. For the 
triangular systems, the upper 
curve (—) denotes the 
antiferromagnet and the lower 
curve (+) the ferromagnet. 
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energy) then the entropy as T — 0 is non-zero. We can find the entropy from 
the free energy using S = —OF/OT. 

The entropy of the triangular antiferromagnet at absolute zero appears to 
be about 0.3, that is, about half its high temperature value (figure 31.16). 
The mean energy as a function of temperature is plotted in figure 31.17. It is 
evaluated using the identity (EF) = —ô ln Z/O8. 

Figure 31.18 shows the estimated heat capacity (taking raw derivatives of 
the mean energy) as a function of temperature for the triangular models with 
widths 4 and 8. Figure 31.19 shows the fluctuations in energy as a function of 
temperature. All of these figures should show smooth graphs; the roughness of 
the curves is due to inaccurate numerics. The nature of any phase transition 
is not obvious, but the graphs seem compatible with the assertion that the 
ferromagnet shows, and the antiferromagnet does not show a phase transition. 

The pictures of the free energy in figure 31.15 give some insight into how 
we could predict the transition temperature. We can see how the two phases 
of the ferromagnetic systems each have simple free energies: a straight sloping 
line through F = 0, T = 0 for the high temperature phase, and a horizontal 
line for the low temperature phase. (The slope of each line shows what the 
entropy per spin of that phase is.) The phase transition occurs roughly at 
the intersection of these lines. So we predict the transition temperature to be 
linearly related to the ground state energy. 
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Figure 31.17. Mean energy versus 
temperature of long thin strip 
Ising models with width 8. 
Compare with figure 31.3. 


Figure 31.18. Heat capacities of 
(a) rectangular model; (b) 
triangular models with different 
widths, (+) and (—) denoting 
ferromagnet and antiferromagnet. 
Compare with figure 31.11. 


Figure 31.19. Energy variances, 
per spin, of (a) rectangular model; 
(b) triangular models with 
different widths, (+) and (—) 
denoting ferromagnet and 
antiferromagnet. Compare with 
figure 31.11. 
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Comparison with the Monte Carlo results 


The agreement between the results of the two experiments seems very good. 
The two systems simulated (the long thin strip and the periodic square) are 
not quite identical. One could a more accurate comparison by finding all 
eigenvalues for the strip of width W and computing X` AW to get the partition 
function of a W x W patch. 


> 31.3 Exercises 


> Exercise 31.2.4] What would be the best way to extract the entropy from the 
Monte Carlo simulations? What would be the best way to obtain the 
entropy and the heat capacity from the partition function computation? 


Exercise 31.3.9] An Ising model may be generalized to have a coupling Jmn 
= between any spins m and n, and the value of Jmn could be different for each 
m and n. In the special case where all the couplings are positive we know 
that the system has two ground states, the all-up and all-down states. For a 
more general setting of Jmn it is conceivable that there could be many ground 


states. 
Imagine that it is required to make a spin system whose local minima are 
a given list of states x(1),X2),---,Xs)- Can you think of a way of setting J 


such that the chosen states are low energy states? You are allowed to adjust 
all the {Jinn} to whatever values you wish. 
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Exact Monte Carlo Sampling 


»> 32.1 The problem with Monte Carlo methods 


For high-dimensional problems, the most widely used random sampling meth- 
ods are Markov chain Monte Carlo methods like the Metropolis method, Gibbs 
sampling, and slice sampling. 

The problem with all these methods is this: yes, a given algorithm can be 
guaranteed to produce samples from the target density P(x) asymptotically, 
‘once the chain has converged to the equilibrium distribution’. But if one runs 
the chain for too short a time T, then the samples will come from some other 
distribution P‘) (x). For how long must the Markov chain be run before it has 
‘converged’? As was mentioned in Chapter 29, this question is usually very 
hard to answer. However, the pioneering work of Propp and Wilson (1996) 
allows one, for certain chains, to answer this very question; furthermore Propp 
and Wilson show how to obtain ‘exact’ samples from the target density. 


»> 32.2 Exact sampling concepts 


Propp and Wilson’s exact sampling method (also known as ‘perfect simulation’ 
or ‘coupling from the past’) depends on three ideas. 


Coalescence of coupled Markov chains 


First, if several Markov chains starting from different initial conditions share 
a single random-number generator, then their trajectories in state space may 
coalesce; and, having, coalesced, will not separate again. If all initial condi- 
tions lead to trajectories that coalesce into a single trajectory, then we can be 
sure that the Markov chain has ‘forgotten’ its initial condition. Figure 32.1a-i 
shows twenty-one Markov chains identical to the one described in section 29.4, 
which samples from {0,1,..., 20} using the Metropolis algorithm (figure 29.12, 
p.368); each of the chains has a different initial condition but they are all driven 
by a single random number generator; the chains coalesce after about 80 steps. 
Figure 32.1a-ii shows the same Markov chains with a different random number 
seed; in this case, coalescence does not occur until 400 steps have elapsed (not 
shown). Figure 32.1b shows similar Markov chains, each of which has identical 
proposal density to those in section 29.4 and figure 32.1a; but in figure 32.1b, 
the proposed move at each step, ‘left’ or ‘right’, is obtained in the same way by 
all the chains at any timestep, independent of the current state. This coupling 
of the chains changes the statistics of coalescence. Because two neighbouring 
paths merge only when a rejection occurs, and rejections occur only at the 
walls (for this particular Markov chain), coalescence will occur only when the 
chains are all in the leftmost state or all in the rightmost state. 
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Figure 32.1. Coalescence, the first 
idea behind the exact sampling 
method. Time runs from bottom 
to top. In the leftmost panel, 
coalescence occurred within 100 
steps. Different coalescence 
properties are obtained depending 
on the way each state uses the 
random numbers it is supplied 
with. (a) Two runs of a 
Metropolis simulator in which the 
random bits that determine the 
proposed step depend on the 
current state; a different random 
number seed was used in each 
case. (b) In this simulator the 
random proposal (‘left’ or ‘right’) 
is the same for all states. In each 
panel, one of the paths, the one 
starting at location x = 8, has 
been highlighted. 
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Coupling from the past 


How can we use the coalescence property to find an exact sample from the 
equilibrium distribution of the chain? The state of the system at the moment 
when complete coalescence occurs is not a valid sample from the equilibrium 
distribution; for example in figure 32.1b, final coalescence always occurs when 
the state is against one of the two walls, because trajectories merge only at 
the walls. So sampling forward in time until coalescence occurs is not a valid 
method. 

The second key idea of exact sampling is that we can obtain exact samples 
by sampling from a time To in the past, up to the present. If coalescence 
has occurred, the present sample is an unbiased sample from the equilibrium 
distribution; if not, we restart the simulation from a time Tọ further into 
the past, reusing the same random numbers. The simulation is repeated at a 
sequence of ever more distant times To, with a doubling of To from one run to 
the next being a convenient choice. When coalescence occurs at a time before 
‘the present’, we can record x(0) as an exact sample from the equilibrium 
distribution of the Markov chain. 

Figure 32.2 shows two exact samples produced in this way. In the leftmost 
panel of figure 32.2a, we start twenty-one chains in all possible initial condi- 
tions at Tg = —50 and run them forward in time. Coalescence does not occur. 
We restart the simulation from all possible initial conditions at Tọ = —100, 
and reset the random number generator in such a way that the random num- 
bers generated at each time t (in particular, from t = —50 to t = 0) will be 
identical to what they were in the first run. Notice that the trajectories pro- 
duced from t = —50 to t = 0 by these runs that started from Tọ = —100 are 
identical to a subset of the trajectories in the first simulation with To = —50. 
Coalescence still does not occur, so we double To again to Tg = —200. This 
time, all the trajectories coalesce and we obtain an exact sample, shown by 
the arrow. If we pick an earlier time such as Tọ = —500, all the trajectories 
must still end in the same point at t = 0, since every trajectory must pass 
through some state at t = —200, and all those states lead to the same final 
point. So if we ran the Markov chain for an infinite time in the past, from any 
initial condition, it would end in the same state. Figure 32.2b shows an exact 
sample produced in the same way with the Markov chains of figure 32.1b. 

This method, called coupling from the past, is important because it allows 
us to obtain exact samples from the equilibrium distribution; but, as described 
here, it is of little practical use, since we are obliged to simulate chains starting 
in all initial states. In the examples shown, there are only twenty-one states, 
but in any realistic sampling problem there will be an utterly enormous number 
of states — think of the 21000 states of a system of 1000 binary spins, for 
example. The whole point of introducing Monte Carlo methods was to try to 
avoid having to visit all the states of such a system! 


Monotonicity 


Having established that we can obtain valid samples by simulating forward 
from times in the past, starting in all possible states at those times, the third 
trick of Propp and Wilson, which makes the exact sampling method useful 
in practice, is the idea that, for some Markov chains, it may be possible to 
detect coalescence of all trajectories without simulating all those trajectories. 
This property holds, for example, in the chain of figure 32.1b, which has the 
property that two trajectories never cross. So if we simply track the two tra- 
jectories starting from the leftmost and rightmost states, we will know that 
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Figure 32.2. ‘Coupling from the past’, the second idea behind the exact sampling method. 
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Figure 32.3. (a) Ordering of states, the third idea behind the exact sampling method. The trajectories 
shown here are the left-most and right-most trajectories of figure 32.2b. In order to establish 


what the state at time zero is, we only need to run simulations from To = —50, To = —100, 
and Tọ = —200, after which point coalescence occurs. 

(b,c) Two more exact samples from the target density, generated by this method, and 
different random number seeds. The initial times required were Ty = —50 and Tp = —1000, 


respectively. 
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coalescence of all trajectories has occurred when those two trajectories co- 
alesce. Figure 32.3a illustrates this idea by showing only the left-most and 
right-most trajectories of figure 32.2b. Figure 32.3(b,c) shows two more ex- 
act samples from the same equilibrium distribution generated by running the 
‘coupling from the past’ method starting from the two end-states alone. In 
(b), two runs coalesced starting from To = —50; in (c), it was necessary to try 
times up to Tọ = —1000 to achieve coalescence. 


32.3 Exact sampling from interesting distributions 


In the toy problem we studied, the states could be put in a one-dimensional 
order such that no two trajectories crossed. The states of many interesting 
state spaces can also be put into a partial order and coupled Markov chains 
can be found that respect this partial order. [An example of a partial order 
on the four possible states of two spins is this: (+,+) > (+,—) > (-,-); 
and (+,+) > (—,+) > (—,—); and the states (+,—) and (—,+) are not 
ordered.] For such systems, we can show that coalescence has occurred merely 
by verifying that coalescence has occurred for all the histories whose initial 
states were ‘maximal’ and ‘minimal’ states of the state space. 








As an example, consider the Gibbs sampling method applied to a ferro- 
magnetic Ising spin system, with the partial ordering of states being defined 
thus: state x is ‘greater than or equal to’ state y if x; > y; for all spins i. 
The maximal and minimal states are the the all-up and all-down states. The 
Markov chains are coupled together as shown in algorithm 32.4. Propp and 
Wilson (1996) show that exact samples can be generated for this system, al- 
though the time to find exact samples is large if the Ising model is below its 
critical temperature, since the Gibbs sampling method itself is slowly-mixing 
under these conditions. Propp and Wilson have improved on this method for 
the Ising model by using a Markov chain called the single-bond heat bath 
algorithm to sample from a related model called the random cluster model; 
they show that exact samples from the random cluster model can be obtained 
rapidly and can be converted into exact samples from the Ising model. Their 
ground-breaking paper includes an exact sample from a 16-million-spin Ising 
model at its critical temperature. A sample for a smaller Ising model is shown 
in figure 32.5. 


A generalization of the exact sampling method for ‘non-attractive’ distri- 
butions 


The method of Propp and Wilson for the Ising model, sketched above, can 
be applied only to probability distributions that are, as they call them, ‘at- 
tractive’. Rather than define this term, let’s say what it means, for practical 
purposes: the method can be applied to spin systems in which all the cou- 
plings are positive (e.g., the ferromagnet), and to a few special spin systems 
with negative couplings (e.g., as we already observed in Chapter 31, the rect- 
angular ferromagnet and antiferromagnet are equivalent); but it cannot be 
applied to general spin systems in which some couplings are negative, because 
in such systems the trajectories followed by the all-up and all-down states 
are not guaranteed to be upper and lower bounds for the set of all trajecto- 
ries. Fortunately, however, we do not need to be so strict. It is possible to 
re-express the Propp and Wilson algorithm in a way that generalizes to the 
case of spin systems with negative couplings. The idea of the summary state 
version of exact sampling is still that we keep track of bounds on the set of 





Compute Qi := Dr JijX; 
Draw u from Uniform(0, 1) 
If u < 1/(1 + e7?) 


zi := +1 
Else 
iml 





Algorithm 32.4. Gibbs sampling 
coupling method. The Markov 
chains are coupled together by 
having all chains update the same 
spin ¿ at each time step and 
having all chains share a common 
sequence of random numbers u. 





Figure 32.5. An exact sample from 
the Ising model at its critical 
temperature, produced by 

D.B. Wilson. Such samples can be 
produced within seconds on an 
ordinary computer by exact 
sampling. 
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all trajectories, and detect when these bounds are equal, so as to find exact 
samples. But the bounds will not themselves be actual trajectories, and they 
will not necessarily be tight bounds. 

Instead of simulating two trajectories, each of which moves in a state space 
{-1, +1} , we simulate one trajectory envelope in an augmented state space 
{—1,+1,?}%, where the symbol ? denotes ‘either —1 or +1’. We call the state 
of this augmented system the ‘summary state’. An example summary state of 





a six-spin system is ++-?+?. This summary state is shorthand for the set of 
states 
++-+++, ++-++-, ++--++, t4+--4-. 

The update rule at each step of the Markov chain takes a single spin, enu- 
merates all possible states of the neighbouring spins that are compatible with 
the current summary state, and, for each of these local scenarios, computes 
the new value (+ or -) of the spin using Gibbs sampling (coupled to a random 
number u as in algorithm 32.4). If all these new values agree, then the new 
value of the updated spin in the summary state is set to the unanimous value 
(+ or -). Otherwise, the new value of the spin in the summary state is ‘?’. The 
initial condition, at time Tọ, is given by setting all the spins in the summary 
state to ‘?’, which corresponds to considering all possible start configurations. 

In the case of a spin system with positive couplings, this summary state 
simulation will be identical to the simulation of the uppermost state and low- 
ermost states, in the style of Propp and Wilson, with coalescence occuring 
when all the ‘?’ symbols have disappeared. The summary state method can 
be applied to general spin systems with any couplings. The only shortcoming 
of this method is that the envelope may describe an unnecessarily large set of 
states, so there is no guarantee that the summary state algorithm will con- 
verge; the time for coalescence to be detected may be considerably larger than 
the actual time taken for the underlying Markov chain to coalesce. 

The summary state scheme has been applied to exact sampling in belief 
networks by Harvey and Neal (2000), and to the triangular antiferromagnetic 
Ising model by Childs et al. (2001). Summary state methods were first intro- 
duced by Huber (1998); they also go by the names sandwiching methods and 
bounding chains. 


Further reading 


For further reading, impressive pictures of exact samples from other distribu- 
tions, and generalizations of the exact sampling method, browse the perfectly- 
random sampling website.! 

For beautiful exact-sampling demonstrations running live in your web- 
browser, see Jim Propp’s website.” 


Other uses for coupling 


The idea of coupling together Markov chains by having them share a random 
number generator has other applications beyond exact sampling. Pinto and 
Neal (2001) have shown that the accuracy of estimates obtained from a Markov 
chain Monte Carlo simulation (the second problem discussed in section 29.1, 
p.357), using the estimator 


bp 


1 t 
res )), (32.1) 


‘http: //www.dbwilson.com/exact/ 
*nttp://www.math.wisc.edu/~propp/tiling/www/applets/ 
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Figure 32.6. A perfectly random 
tiling of a hexagon by lozenges, 
provided by J.G. Propp and 
D.B. Wilson. 





can be improved by coupling the chain of interest, which converges to P, to a 
second chain, which generates samples from a second, simpler distribution, Q. 
The coupling must be set up in such a way that the states of the two chains 
are strongly correlated. The idea is that we first estimate the expectations of 
a function of interest, ¢, under P and under Q in the normal way (32.1) and 
compare the estimate under Q, do, with the true value of the expectation 
under Q, ®g which we assume can be evaluated exactly. If ĉo is an overes- 
timate then it is likely that Êp will be an overestimate too. The difference 
(Êo — g) can thus be used to correct Êp. 


> 32.4 Exercises 


> Exercise 32.1.1% P421] Is there any relationship between the probability dis- 
tribution of the time taken for all trajectories to coalesce, and the equi- 
libration time of a Markov chain? Prove that there is a relationship, or 
find a single chain that can be realized in two different ways that have 
different coalescence times. 


> Exercise 32.2.!?1 Imagine that Fred ignores the requirement that the random 
bits used at some time t, in every run from increasingly distant times 
To, must be identical, and makes a coupled-Markov-chain simulator that 
uses fresh random numbers every time Tọ is changed. Describe what 
happens if Fred applies his method to the Markov chain that is intended 
to sample from the uniform distribution over the states 0, 1, and 2, using 
the Metropolis method, driven by a random bit source as in figure 32.1b. 


Exercise 32.3.15] Investigate the application of perfect sampling to linear re- 
gression in Holmes and Mallick (1998) or Holmes and Denison (2002) 
and try to generalize it. 


Exercise 32.4.19] The concept of coalescence has many applications. Some sur- 
names are more frequent than others, and some die out altogether. Make 
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a model of this process; how long will it take until everyone has the same 
surname? 


Similarly, variability in any particular portion of the human genome 
(which forms the basis of forensic DNA fingerprinting) is inherited like a 
surname. A DNA fingerprint is like a string of surnames. Should the fact 
that these surnames are subject to coalescences, so that some surnames 
are by chance more prevalent than others, affect the way in which DNA 
fingerprint evidence is used in court? 


> Exercise 32.5.!?] How can you use a coin to create a random ranking of 3 
people? Construct a solution that uses exact sampling. For example, 
you could apply exact sampling to a Markov chain in which the coin is 
repeatedly used alternately to decide whether to switch first and second, 
then whether to switch second and third. 


Exercise 32.6.1] Finding the partition function Z of a probability distribution 
is a difficult problem. Many Markov chain Monte Carlo methods produce 
valid samples from a distribution without ever finding out what Z is. 


Is there any probability distribution and Markov chain such that either 
the time taken to produce a perfect sample or the number of random bits 
used to create a perfect sample are related to the value of Z? Are there 
some situations in which the time to coalescence conveys information 
about Z? 


> 32.5 Solutions 


Solution to exercise 32.1 (p.420). It is perhaps surprising that there is no di- 
rect relationship between the equilibration time and the time to coalescence. 
We can prove this using the example of the uniform distribution over the inte- 
gers A= {0,1,2,...,20}. A Markov chain that converges to this distribution 
in exactly one iteration is the chain for which the probability of state s41 
given s+ is the uniform distribution, for all s+. Such a chain can be coupled 
to a random number generator in two ways: (a) we could draw a random 
integer u € A, and set sp1 equal to u regardless of s;; or (b) we could draw 
a random integer u € A, and set s¿+1 equal to (st + u)mod21. Method (b) 
would produce a cohort of trajectories locked together, similar to the trajec- 
tories in figure 32.1, except that no coalescence ever occurs. Thus, while the 
equilibration times of methods (a) and (b) are both one, the coalescence times 
are respectively one and infinity. 

It seems plausible on the other hand that coalescence time provides some 
sort of upper bound on equilibration time. 
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Variational Methods 


Variational methods are an important technique for the approximation of com- 
plicated probability distributions, having applications in statistical physics, 
data modelling and neural networks. 


> 33.1 Variational free energy minimization 


One method for approximating a complex distribution in a physical system is 
mean field theory. Mean field theory is a special case of a general variational 
free energy approach of Feynman and Bogoliubov which we will now study. 
The key piece of mathematics needed to understand this method is Gibbs’ 
inequality, which we repeat here. Gibbs’ inequality first appeared in 


equation (1.24); see also 
The relative entropy between two probability distributions Q(x) and P(x) exercise 2.26 (p.37). 


that are defined over the same alphabet Ax is 


Da (QllP) = E Qe) oe $I. (33.1) 


The relative entropy satisfies Dx, (Q||P) > 0 (Gibbs’ inequality) with 
equality only if Q =P. In general Dx (Q||P) 4 DKL (P\|Q). 
In this chapter we will replace the log by In, and measure the divergence 
in nats. 

Probability distributions in statistical physics 

In statistical physics one often encounters probability distributions of the form 


P(x| B,J) = Fay PEI) (33.2) 


where for example the state vector is x € {—1,+1}%, and E(x;J) is some 
energy function such as 


E(x;J) = -5 X Jmn8mtn — X hnan. (33.3) 


The partition function (normalizing constant) is 


Z(B,5) = X exp[-bE(x; J)]. (33.4) 
The probability distribution of equation (33.2) is complex. Not unbearably 
complex — we can, after all, evaluate E(x; J) for any particular x in a time 
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polynomial in the number of spins. But evaluating the normalizing constant 
Z(6,J) is difficult, as we saw in Chapter 29, and describing the properties of 
the probability distribution is also hard. Knowing the value of E(x;J) at a 
few arbitrary points x, for example, gives no useful information about what 
the average properties of the system are. 

An evaluation of Z(G, J) would be particularly desirable because from Z 
we can derive all the thermodynamic properties of the system. 

Variational free energy minimization is a method for approximating the 
complex distribution P(x) by a simpler ensemble Q(x; 0) that is parameterized 
by adjustable parameters 0. We adjust these parameters so as to get Q to 
best approximate P, in some sense. A by-product of this approximation is a 
lower bound on Z(G, J). 


The variational free energy 


The objective function chosen to measure the quality of the approximation is 
the variational free energy 


E <o In — LEO 
A= Q(x;6)1 ap BEGG] (33.5) 


This expression can be manipulated into a couple of interesting forms: first, 


BPO) = BY QARN -I Qg C0) 


= ß paia — Sa, (33.7) 


where (E(x; J oe is the average of the energy function under the distribution 
Q(x; 0), and Sg is the entropy of the distribution Q(x; 0) (we set kg to one 
in the definition of S so that it is identical to the definition of the entropy H 
in Part I). 

Second, we can use the definition of P(x |G, J) to write: 


BF(@) = 3 Q(x; 0)1 Ate — In Z(6,J) (33.8) 
= Dx (QIIP) + BF, (33.9) 


where F is the true free energy, defined by 
BF = -ln Z(6,J), (33.10) 


and Dg (QI|P) is the relative entropy between the approximating distribution 
Q(x; 6) and the true distribution P(x | 8,J). Thus by Gibbs’ inequality, the 
variational free energy F (0) is bounded below by F and attains this value only 
for Q(x; 6) = P(x |8, J). 

Our strategy is thus to vary @ in such a way that GF (0) is minimized. 
The approximating distribution then gives a simplified approximation to the 
true distribution that may be useful, and the value of GF (0) will be an upper 
bound for GF. Equivalently, Z = e~°"' is a lower bound for Z. 


Can the objective function BF be evaluated? 


We have already agreed that the evaluation of various interesting sums over x 
is intractable. For example, the partition function 


Z =) exp(—GE(x;J)), (33.11) 
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the energy 
1 
(E)p = 5d) E; J) exp(-6E(x; J), (33.12) 
and the entropy 
1 
S= P(x | 6, J) ln ——_——_ 33.13 


x 


are all presumed to be impossible to evaluate. So why should we suppose 
that this objective function GF (0), which is also defined in terms of a sum 
over all x (33.5), should be a convenient quantity to deal with? Well, for a 
range of interesting energy functions, and for sufficiently simple approximating 
distributions, the variational free energy can be efficiently evaluated. 


»> 33.2 Variational free energy minimization for spin systems 


An example of a tractable variational free energy is given by the spin system 
whose energy function was given in equation (33.3), which we can approximate 
with a separable approximating distribution, 


Q(x;a) = ze 2 ann) i (33.14) 


The variational parameters @ of the variational free energy (33.5) are the 
components of the vector a. To evaluate the variational free energy we need 
the entropy of this distribution, 


1 


So = 2 Q(x;a) In ET (33.15) 
and the mean of the energy, 
(E(x; J)o = X Q(x; a) E(x; J). (33.16) 


x 


The entropy of the separable approximating distribution is simply the sum of 
the entropies of the individual spins (exercise 4.2, p.68), 


SoS Ga) (33.17) 


where qn is the probability that spin n is +1, 


ere 1 
ees Cf of 
em + e~an 1+ exp(—2a,) (93:18) 
and 
Ha edo (33.19) 
q)=qmn- — q) n ——_.. 
í q (1-4) 


The mean energy under Q is easy to obtain because De JmnZmđTn İS a sum 
of terms each involving the product of two independent random variables. 
(There are no self-couplings, so Jmn = 0 when m = n.) If we define the mean 
value of £n to be Zn, which is given by 


7 ean — en an 
B= ees = tanh(a,) = 2dn — 1, (33.20) 


33.2: Variational free energy minimization for spin systems 


we obtain 


(E(x; J)o 


> Axa) -+ eo ee hn (33.21) 
= T SmnnEmn Y. hnFn- 


(33.22) 


So the variational free energy is given by 


BF(a) = B (E(x; J))g—Se = B (-4 5 ImnimEn ia 5 int] -5 HE (qn). 


(33.23) 


We now consider minimizing this function with respect to the variational 





parameters a. If q = 1/(1 + e~*), the derivative of the entropy is 
ee 
2 nso = n = = 2a. (33.24) 
q 


So we obtain 


2 Bia) 








dam 


= ddm 1 — qm ddm 
6B - Ds Inn - hn (23) - in( = ) ($=) 





Ogm x 
2 (+) -» (= Tena in] +am (33.25) 
This derivative is equal to zero when 
am=ß 2 Iran + in] (33.26) 


So F(a) is extremized at any point that satisfies equation (33.26) and 


Zn = tanh(a,). (33.27) 

The variational free energy F (a) may be a multimodal function, in which 
case each stationary point (maximum, minimum or saddle) will satisfy equa- 
tions (33.26) and (33.27). One way of using these equations, in the case of a 
system with an arbitrary coupling matrix J, is to update each parameter am 
and the corresponding value of Zm using equation (33.26), one at a time. This 
asynchronous updating of the parameters is guaranteed to decrease 8 F (a). 

Equations (33.26) and (33.27) may be recognized as the mean field equa- 
tions for a spin system. The variational parameter an may be thought of as 
the strength of a fictitious field applied to an isolated spin n. Equation (33.27) 
describes the mean response of spin n, and equation (33.26) describes how the 
field am is set in response to the mean state of all the other spins. 

The variational free energy derivation is a helpful viewpoint for mean field 
theory for two reasons. 


1. This approach associates an objective function GF with the mean field 
equations; such an objective function is useful because it can help identify 
alternative dynamical systems that minimize the same function. 
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Figure 33.1. The variational free 
energy of the two-spin system 
whose energy is E(x) = —2 122, as 
a function of the two variational 
parameters qı and q2. The 
inverse-temperature is 8 = 1.44. 
The function plotted is 


BF = -p312 HÍ? (a)- Hl (q2), 


where Zn = 2qn — 1. Notice that 
for fixed q2 the function is 
convex ~ with respect to qi, and 
for fixed qı it is convex ~ with 
respect to qo. 
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2. The theory is readily generalized to other approximating distributions. 
We can imagine introducing a more complex approximation Q(x; 0) that 
might for example capture correlations among the spins instead of mod- 
elling the spins as independent. One could then evaluate the variational 
free energy and optimize the parameters 0 of this more complex approx- 
imation. The more degrees of freedom the approximating distribution 
has, the tighter the bound on the free energy becomes. However, if the 
complexity of an approximation is increased, the evaluation of either the 
mean energy or the entropy typically becomes more challenging. 


> 33.3 Example: mean field theory for the ferromagnetic Ising model 


In the simple Ising model studied in Chapter 31, every coupling Jmn is equal 
to J if m and n are neighbours and zero otherwise. There is an applied 
field hn = h that is the same for all spins. A very simple approximating 
distribution is one with just a single variational parameter a, which defines a 
separable distribution 


Q(x; a) = ze (= es] (33.28) 
in which all spins are independent and have the same probability 
1 
In = Teele) (33.29) 
of being up. The mean magnetization is 
z = tanh(a) (33.30) 


and the equation (33.26) which defines the minimum of the variational free 
energy becomes 


a= B(CJE+A), (33.31) 


where C is the number of couplings that a spin is involved in — C = 4 in the 
case of a rectangular two-dimensional Ising model. We can solve equations 
(33.30) and (33.31) for z numerically — in fact, it is easiest to vary Z and solve 
for 8 — and obtain graphs of the free energy minima and maxima as a function 
of temperature as shown in figure 33.2. The solid line shows g versus T = 1/8 
for the case C = 4, J = 1. 

When A = 0, there is a pitchfork bifurcation at a critical temperature qat, 
[A pitchfork bifurcation is a transition like the one shown by the solid lines in 
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33 — Variational Methods 


Figure 33.2. Solutions of the 
variational free energy 
extremization problem for the 
Ising model, for three different 
applied fields h. Horizontal axis: 
temperature T = 1/8. Vertical 
axis: magnetization 7. The 
critical temperature found by 
mean field theory is T™ = 4. 
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figure 33.2, from a system with one minimum as a function of a (on the right) 
to a system (on the left) with two minima and one maximum; the maximum 
is the middle one of the three lines. The solid lines look like a pitchfork.] 
Above this temperature, there is only one minimum in the variational free 
energy, at a = 0 and Z = 0; this minimum corresponds to an approximating 
distribution that is uniform over all states. Below the critical temperature, 
there are two minima corresponding to approximating distributions that are 
symmetry-broken, with all spins more likely to be up, or all spins more likely 
to be down. The state = 0 persists as a stationary point of the variational 
free energy, but now it is a local maximum of the variational free energy. 

When h > 0, there is a global variational free energy minimum at any 
temperature for a positive value of z, shown by the upper dotted curves in 
figure 33.2. As long as h < JC, there is also a second local minimum in the 
free energy, if the temperature is sufficiently small. This second minimum cor- 
responds to a self-preserving state of magnetization in the opposite direction 
to the applied field. The temperature at which the second minimum appears 
is smaller than ait. and when it appears, it is accompanied by a saddle point 
located between the two minima. A name given to this type of bifurcation is 
a saddle-node bifurcation. 

The variational free energy per spin is given by 


BF = 6 ( O jg ha) HL (Z a8 =) . (33.32) 








2 2 


Exercise 33.1.!7] Sketch the variational free energy as a function of its one 
= parameter 7 for a variety of values of the temperature T and the applied 


field h. 


Figure 33.2 reproduces the key properties of the real Ising system — that, 
for h = 0, there is a critical temperature below which the system has long- 
range order, and that it can adopt one of two macroscopic states. However, 
by probing a little more we can reveal some inadequacies of the variational 
approximation. To start with, the critical temperature eae is 4, which is 
nearly a factor of 2 greater than the true critical temperature Te = 2.27. Also, 
the variational model has equivalent properties in any number of dimensions, 
including d = 1, where the true system does not have a phase transition. So 
the bifurcation at ift should not be described as a phase transition. 

For the case h = 0 we can follow the trajectory of the global minimum as 
a function of @ and find the entropy, heat capacity and fluctuations of the ap- 
proximating distribution and compare them with those of a real 8 x 8 fragment 
using the matrix method of Chapter 31. As shown in figure 33.3, one of the 
biggest differences is in the fluctuations in energy. The real system has large 
fluctuations near the critical temperature, whereas the approximating distri- 
bution has no correlations among its spins and thus has an energy-variance 
which scales simply linearly with the number of spins. 


> 33.4 Variational methods in inference and data modelling 


In statistical data modelling we are interested in the posterior probability 
distribution of a parameter vector w given data D and model assumptions H, 
eae (Dw. 1) P(w |H) 

P(D|w,H)P(wl/H 
In traditional approaches to model fitting, a single parameter vector w is op- 
timized to find the mode of this distribution. What is really of interest is 


(33.33) 
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Free Energy Figure 33.3. Comparison of 
approximating distribution’s 
properties with those of a real 

8 x 8 fragment. Notice that the 
variational free energy of the 
approximating distribution is 
indeed an upper bound on the 
free energy of the real system. All 


mean field theory ——~ ities ‘ eee) 
veal 8x8 system ———- quantities are shown ‘per spin’. 
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the whole distribution. We may also be interested in its normalizing constant 
P(D|#) if we wish to do model comparison. The probability distribution 
P(w|D,#) is often a complex distribution. In a variational approach to in- 
ference, we introduce an approximating probability distribution over the pa- 
rameters, Q(w; 0), and optimize this distribution (by varying its own param- 
eters 0) so that it approximates the posterior distribution of the parameters 
P(w|D,H) well. 

One objective function we may choose to measure the quality of the ap- 
proximation is the variational free energy 


Q(w; 4) 
(D|w,H)P(w|H) 





F(@) = J d*w Q(w;@) In = (33.34) 
The denominator P(D | w, 7) P(w | H) is, within a multiplicative constant, the 
posterior probability P(w|D,H) = P(D|w,H)P(w|H)/P(D|H). So the 
variational free energy F(0@) can be viewed as the sum of — In P(D |H) and 
the relative entropy between Q(w; 0) and P(w|D,H). F(@) is bounded below 
by —In P(D |H) and only attains this value for Q(w; 0) = P(w|D,H). For 
certain models and certain approximating distributions, this free energy, and 
its derivatives with respect to the approximating distribution’s parameters, 
can be evaluated. 

The approximation of posterior probability distributions using variational 
free energy minimization provides a useful approach to approximating Bayesian 
inference in a number of fields ranging from neural networks to the decoding of 
error-correcting codes (Hinton and van Camp, 1993; Hinton and Zemel, 1994; 
Dayan et al., 1995; Neal and Hinton, 1998; MacKay, 1995a). The method 
is sometimes called ensemble learning to contrast it with traditional learning 
processes in which a single parameter vector is optimized. Another name for 
it is variational Bayes. Let us examine how ensemble learning works in the 
simple case of a Gaussian distribution. 


> 33.5 The case of an unknown Gaussian: approximating the posterior 
distribution of u and o 


We will fit an approximating ensemble Q(y,0) to the posterior distribution 
that we studied in Chapter 24, 


P(u,0 | {£n} L) = PU Gay es | u,0)P(u, 0) (33.35) 


Ptn) 
1 N(p—-#)?+S\ 1 
_ meee (A a (33.36) 
Pan) l i 


We make the single assumption that the approximating ensemble is separable 
in the form Q(u, o) = Q,(u)Qo(c). No restrictions on the functional form of 
Qulu) and Q,(c) are made. 

We write down a variational free energy, 





a l= 


5 Qulu)Qolo) 
FQ) = f ando Qulu)Qo(o) m p e — 

4 P(D| p,0)P(u,0) 
We can find the optimal separable distribution Q by considering separately 
the optimization of F over Q,,() for fixed Q,(c), and then the optimization 
of Qo(c) for fixed Q, (u). 


(33.37) 
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Optimization of Q,(u) 
As a functional of Q,(u), F is: 


F = -fan Qulu) [fa Qolo)n P(D |u, o) + inf PH) /Qu(0] + «K (33.38) 





1 = 
= fangu | feo Qnloywasin— a? + inguin] +e, (8339) 
where 3 = 1/o? and k denote constants that do not depend on Qalu). The 
dependence on Q, thus collapses down to a simple dependence on the mean 
B= eo Q,(c)1/07. (33.40) 
Now we can recognize the function —N Bs(u — 7)? as the logarithm of a 
Gaussian identical to the posterior distribution for a particular value of 6 = £. 
Since a relative entropy f QIn(Q/P) is minimized by setting Q = P, we can 
immediately write down the distribution QP (u) that minimizes F for fixed 
Qo: 


QP (u) = P(u | D, 8, H) = Normal (u; 2,07) p). (33.41) 


where TiD = 1/(NB). 


Optimization of Q,(c) 


We represent Qo(0) using the density over 8, Qo(8) = Qo(0) |do/dp|. As a 
functional of Q,(@), F is (neglecting additive constants): 


F 


- fapa- [fa Qu(u)n P(D | p,0) + inlP(6)/Qn(8)] (33.42) 





[28 Q0(8) [Non + $)5/2- ($ - 1) 2B +1nQu(9)] , (33-43) 
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Figure 33.4. Optimization of an 
approximating distribution. The 
posterior distribution 

P(u,0 |{£n}), which is the same 
as that in figure 24.1, is shown by 
solid contours. (a) Initial 
condition. The approximating 
distribution Q(u, o) (dotted 
contours) is an arbitrary separable 
distribution. (b) Q, has been 
updated, using equation (33.41). 
(c) Qo has been updated, using 
equation (33.44). (d) Q, updated 
again. (e) Qo updated again. (f) 
Converged approximation (after 
15 iterations). The arrows point 
to the peaks of the two 
distributions, which are at 

on = 0.45 (for P) and oy_1 = 0.5 
(for Q). 


The prior P(o) x 1/o transforms 


to P(8) x 1/8. 
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where the integral over p is performed assuming Q, (u) = Pt (pu). Here, the 8- 
dependent expression in square brackets can be recognized as the logarithm of 
a gamma distribution over 3 — see equation (23.15) — giving as the distribution 
that minimizes F for fixed Qu: 


Q(B) =1(6;8',¢), (33.44) 


with R i N 

E 5 Nop +S) and d= r (33.45) 
In figure 33.4, these two update rules (33.41, 33.44) are applied alternately, 
starting from an arbitrary initial condition. The algorithm converges to the 


optimal approximating ensemble in a few iterations. 


Direct solution for the joint optimum Q, (u)Q- (0) 


In this problem, we do not need to resort to iterative computation to find 
the optimal approximating ensemble. Equations (33.41) and (33.44) define 
the optimum implicitly. We must simultaneously have o7 p = 1/(N B), and 
8 = b'e. The solution is: 

1/8 = S/(N — 1). (33.46) 
This is similar to the true posterior distribution of ø, which is a gamma distri- 
bution with c’ = AS and 1/b! = S/2 (see equation 24.13). This true posterior 
also has a mean value of 8 satisfying 1/8 = S/(N — 1); the only difference is 
that the approximating distribution’s parameter c’ is too large by 1/2. 





The approximations given by variational free energy minimization 


always tend to be more compact than the true distribution. 





In conclusion, ensemble learning gives an approximation to the posterior 
that agrees nicely with the conventional estimators. The approximate poste- 
rior distribution over 3 is a gamma distribution with mean 3 corresponding 
to a variance of ø? = $/(N — 1) = 02,,. And the approximate posterior dis- 
tribution over p is a Gaussian with mean g and standard deviation oy_,/ VN. 

The variational free energy minimization approach has the nice prop- 
erty that it is parameterization-independent; it avoids the problem of basis- 
dependence from which MAP methods and Laplace’s method suffer. 

A convenient software package for automatic implementation of variational 
inference in graphical models is VIBES (Bishop et al., 2002). It plays the same 
role for variational inference as BUGS plays for Monte Carlo inference. 


> 33.6 Interlude 


One of my students asked: 


How do you ever come up with a useful approximating distribution, 
given that the true distribution is so complex you can’t compute 
it directly? 


Let’s answer this question in the context of Bayesian data modelling. Let the 
‘true’ distribution of interest be the posterior probability distribution over a 
set of parameters x, P(x|D). A standard data modelling practice is to find 
a single, ‘best-fit’ setting of the parameters, x*, for example, by finding the 
maximum of the likelihood function P(D |x), or of the posterior distribution. 
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One interpretation of this standard practice is that the full description of 
our knowledge about x, P(x |D), is being approximated by a delta-function, 
a probability distribution concentrated on x*. From this perspective, any 
approximating distribution Q(x; 0), no matter how crummy it is, has to be 
an improvement on the spike produced by the standard method! So even if 
we use only a simple Gaussian approximation, we are doing well. 

We now study an application of the variational approach to a realistic 
example — data clustering. 


> 33.7 K-means clustering and the expectation—maximization algo- 
rithm as a variational method 


In Chapter 20, we introduced the soft K-means clustering algorithm, version 1. 
In Chapter 22, we introduced versions 2 and 3 of this algorithm, and motivated 
the algorithm as a maximum likelihood algorithm. 

K-means clustering is an example of an ‘expectation—maximization’ (EM) 
algorithm, with the two steps, which we called ‘assignment’ and ‘update’, 
being known as the ‘E-step’ and the ‘M-step’ respectively. 

We now give a more general view of K-means clustering, due to Neal 
and Hinton (1998), in which the algorithm is shown to optimize a variational 
objective function. Neal and Hinton’s derivation applies to any EM algorithm. 


The probability of everything 


Let the parameters of the mixture model — the means, standard deviations, and 
weights — be denoted by @. For each data point, there is a missing variable (also 
known as a latent variable), the class label kn for that point. The probability 
of everything, given our assumed model H, is 


N 
P(x kna OIH) = POIH) [] [PC | kn, 6)P(kn|8)|. (33.47) 
n=1 
The posterior probability of everything, given the data, is proportional to the 
probability of everything: 


(0, PER 0 | H) 


POH) P 





Pkn} 0 x) = E 


We now approximate this posterior distribution by a separable distribution 


Qr likn }n—1) Qo (0), (33.49) 


and define a variational free energy in the usual way: 


Qk({kn }na1) Qo(9) 
P({x™, ten 0 | H) l 
(33.50) 
F is bounded below by minus the evidence, In P({x™® }4_, |H). We can now 
make an iterative algorithm with an ‘assignment’ step and an ‘update’ step. 
In the assignment step, Q¢({kn}4_) is adjusted to reduce F, for fixed Qo; in 
the update step, Qg is adjusted to reduce F, for fixed Qg. 
If we wish to obtain exactly the soft K-means algorithm, we impose a 
further constraint on our approximating distribution: Qg is constrained to be 
a delta function centred on a point estimate of 0, 0 = 0*: 


PQr Qo) = J f0 Qulla} a) Qo() In 
{kn} 


Qo(0) = 5(0 — 0*). (33.51) 
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33.8: Variational methods other than free energy minimization 


Upper bound 











o> 1 
tfe 
14 Lower bound 
1 
0 l+e-4 











where A(v) = [g(v) — 1/2] /2v. 


Unfortunately, this distribution contributes to the variational free energy an 
infinitely large integral faro Qo(@) ln Qg (0), so we’d better leave that term 
out of F, treating it as an additive constant. [Using a delta function Qg is not 
a good idea if our aim is to minimize F !] Moving on, our aim is to derive the 
soft K-means algorithm. 


Exercise 33.2.7] Show that, given Q@(@) = 6(@ — 6*), the optimal Qx, in the 
sense of minimizing F’, is a separable distribution in which the probabil- 


ity that k, = k is given by the responsibility ri), 


Exercise 33.3.1] Show that, given a separable Qg as described above, the op- 
timal 0*, in the sense of minimizing F, is obtained by the update step 
of the soft K-means algorithm. (Assume a uniform prior on @.) 


Exercise 33.4.4] We can instantly improve on the infinitely large value of F 
achieved by soft K-means clustering by allowing Qg to be a more general 
distribution than a delta-function. Derive an update step in which Qg is 
allowed to be a separable distribution, a product of Q,({}), Qo({o}), 
and Q,(7). Discuss whether this generalized algorithm still suffers from 
soft K-means’s ‘kaboom’ problem, where the algorithm glues an ever- 
shrinking Gaussian to one data point. 


Sadly, while it sounds like a promising generalization of the algorithm 
to allow Qg to be a non-delta-function, and the ‘kaboom’ problem goes 
away, other artefacts can arise in this approximate inference method, 
involving local minima of F. For further reading, see (MacKay, 1997a; 
MacKay, 2001). 


33.8 Variational methods other than free energy minimization 


There are other strategies for approximating a complicated distribution P(x), 
in addition to those based on minimizing the relative entropy between an 
approximating distribution, Q, and P. One approach pioneered by Jaakkola 
and Jordan is to create adjustable upper and lower bounds QU and Q* to P, 
as illustrated in figure 33.5. These bounds (which are unnormalized densities) 
are parameterized by variational parameters which are adjusted in order to 
obtain the tightest possible fit. The lower bound can be adjusted to maximize 


5 Q(x), 
x 
and the upper bound can be adjusted to minimize 


yO" @). 


(33.52) 


(33.53) 


< exp(ua — A} (1) 
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u € [0,1] 


> g(v)exp[(a—v)/2 — Av) (a? -= v”)| 


Figure 33.5. Illustration of the 
Jaakkola—Jordan variational 
method. Upper and lower bounds 
on the logistic function (solid line) 


1 


9) = ee 


These upper and lower bounds are 
exponential or Gaussian functions 
of a, and so easier to integrate 
over. The graph shows the 
sigmoid function and upper and 
lower bounds with u = 0.505 and 
v = —2.015. 
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Using the normalized versions of the optimized bounds we then compute ap- 
proximations to the predictive distributions. Further reading on such methods 
can be found in the references (Jaakkola and Jordan, 2000a; Jaakkola and Jor- 
dan, 2000b; Jaakkola and Jordan, 1996; Gibbs and MacKay, 2000). 


Further reading 


The Bethe and Kikuchi free energies 


In Chapter 26 we discussed the sum—product algorithm for functions of the 
factor-graph form (26.1). If the factor graph is tree-like, the sum—product algo- 
rithm converges and correctly computes the marginal function of any variable 
£n and can also yield the joint marginal function of subsets of variables that 
appear in a common factor, such as x». 

The sum—product algorithm may also be applied to factor graphs that are 
not tree-like. If the algorithm converges to a fixed point, it has been shown 
that that fixed point is a stationary point (usually a minimum) of a function 
of the messages called the Kikuchi free energy. In the special case where all 
factors in factor graph are functions of one or two variables, the Kikuchi free 
energy is called the Bethe free energy. 

For articles on this idea, and new approximate inference algorithms mo- 
tivated by it, see Yedidia (2000); Yedidia et al. (2000); Welling and Teh 
(2001); Yuille (2001); Yedidia et al. (2001b); Yedidia et al. (2001a). 


> 33.9 Further exercises 


Exercise 33.5.[% P-435] This exercise explores the assertion, made above, that 
= the approximations given by variational free energy minimization al- 
ways tend to be more compact than the true distribution. Consider a 
two dimensional Gaussian distribution P(x) with axes aligned with the 
directions e® = (1,1) and e®@) = (1, —1). Let the variances in these two 
directions be o? and Ge. What is the optimal variance if this distribution 
is approximated by a spherical Gaussian with variance oO, optimized by 
variational free energy minimization? If we instead optimized the objec- 

tive function 
G= fa P(x) In st (33.54) 

Q(x; 0?) 
what would be the optimal value of o?? Sketch a contour of the true 
distribution P(x) and the two approximating distributions in the case 
O1 / 02> 10. 


[Note that in general it is not possible to evaluate the objective func- 
tion G, because integrals under the true distribution P(x) are usually 
intractable. ] 


"Exercise 33.6.1% P436] What do you think of the idea of using a variational 
cll method to optimize an approximating distribution Q which we then use 
as a proposal density for importance sampling? 


Exercise 33.7.!?] Define the relative entropy or Kullback—Leibler divergence be- 
= tween two probability distributions P and Q, and state Gibbs’ inequality. 


Consider the problem of approximating a joint distribution P(x,y) by a 
separable distribution Q(x, y) = Qx(x)Qy (y). Show that if the objec- 
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tive function for this approximation is 


P(x,y) 
G(Qx,Qy) = X P(x,y) logs =——— = 
Zan ? Qx(x)Qy(y) 


that the minimal value of G is achieved when Qx and Qy are equal to 
the marginal distributions over x and y. 


Now consider the alternative objective function 





Qx(x)Qy (y) 
F(Qx,Qy) = 3 Qx(z y) log ; P(x,y) z 

* P@y) qa Be SBS. oA 
CEEE Sle ; pea 1 ls ls o 0 

the probability distribution P(x,y) shown in the margin is to be ap- Ve 1 
eae y 2 Ys Ys 0 o0 
proximated by a separable distribution Q(z,y) = Qx(x)Qy(y). State 3 0 0 4 0 
the value of F(Qx, Qy) if Qx and Qy are set to the marginal distribu- 4 0 0 0 1⁄4 


tions over x and y. 





Show that F(Qx, Qy) has three distinct minima, identify those minima, 
and evaluate F at each of them. 


> 33.10 Solutions 


Solution to exercise 33.5 (p.434). We need to know the relative entropy be- 
tween two one-dimensional Gaussian distributions: 


Normal(x; 0, oQ) 
dz N l(a; In = 
T a hee) Normal(zx; 0, op) 


1 1 1 
= J de Normai(z: 0,09) Iin £ a |- 5 (33.55) 
oq 2 % Ip 
2 
1 o 
E op ape (33.56) 
J oD o? 


So, if we approximate P, whose variances are o? and o2, by Q, whose variances 
are both Cra we find 











1 2 o2 2 o2 
Fe) ==) Ine Si 2 ig 1g ]); (33.57) 
Q 2 2 2 2 2 
%% OF % o5 
differentiating, 
a Ifo [h è 
F= 24 H ; 33.58 
dIn(od) 2 | (= o3 ( ) 
which is zero when 
B ee + : (33.59) 
oa 2\op o3)" l 


Thus we set the approximating distribution’s inverse variance to the mean 
inverse variance of the target distribution P. 

In the case o} = 10 and og = 1, we obtain og ~ V2, which is just a factor 
of V2 larger than o2, pretty much independent of the value of the larger 
standard deviation c1. Variational free energy minimization typically leads to 
approximating distributions whose length scales match the shortest length scale 
of the target distribution. The approximating distribution might be viewed as 
too compact. 
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Figure 33.6. Two separable 
Gaussian approximations (dotted 
lines) to a bivariate Gaussian 
distribution (solid line). (a) The 
approximation that minimizes the 
variational free energy. (b) The 
approximation that minimizes the 
objective function G. In each 
figure, the lines show the contours 
at which x'Ax = 1, where A is 
the inverse covariance matrix of 
the Gaussian. 





(a) 





1 2 2 
Gloh) =- Inog | a | noô H - + constant, (33.60) 
2 oo TQ 


where the constant depends on gı and o2 only. Differentiating, 


d 1 o o2 
Q Q Q 
which is zero when 
2 l; 2 2 
ig (of +03) - (33.62) 


Thus we set the approximating distribution’s variance to the mean variance 
of the target distribution P. 

In the case o1 = 10 and o2 = 1, we obtain og ~ 10/V2, which is just a 
factor of v2 smaller than g1, independent of the value of oy. 

The two approximations are shown to scale in figure 33.6. 


Solution to exercise 33.6 (p.434). The best possible variational approximation 
is of course the target distribution P. Assuming that this is not possible, a 
good variational approximation is more compact than the true distribution. 
In contrast, a good sampler is more heavy tailed than the true distribution. 
An over-compact distribution would be a lousy sampler with a large variance. 
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34 


Independent Component Analysis and 
Latent Variable Modelling 


> 34.1 Latent variable models 


Many statistical models are generative models (that is, models that specify Dae Sela ie 
LLB 
a full probability density over all variables in the situation) that make use of / 
latent variables to describe a probability distribution over observables. / G 
Examples of latent variable models include Chapter 22’s mixture models, | 
r : . . VY UIDDDADAR n AA ANNIV ~ 
which model the observables as coming from a superposed mixture of simple ee ee 
probability distributions (the latent variables are the unknown class labels 3 Ta a Oe Se 
of the examples); hidden Markov models (Rabiner and Juang, 1986; Durbin 
et al., 1998); and factor analysis. Figure 34.1. Error-correcting 
The decoding problem for error-correcting codes can also be viewed in ©des as latent variable models. 
: ; The K latent variables are the 
terms of a latent variable model — figure 34.1. In that case, the encoding . ; 
: f : f independent source bits 
matrix G is normally known in advance. In latent variable modelling, the 51, ..., Sg; these give rise to the 


parameters equivalent to G are usually not known, and must be inferred from observables via the generator 
the data along with the latent variables s. matrix G. 
Usually, the latent variables have a simple distribution, often a separable 
distribution. Thus when we fit a latent variable model, we are finding a de- 
scription of the data in terms of ‘independent components’. The ‘independent 
component analysis’ algorithm corresponds to perhaps the simplest possible 
latent variable model with continuous latent variables. 


> 34.2 The generative model for independent component analysis 


A set of N observations D = {x‘")}‘_, are assumed to be generated as follows. 
Each J-dimensional vector x is a linear mixture of J underlying source signals, 
s: 

x = Gs, (34.1) 


where the matrix of mixing coefficients G is not known. 

The simplest algorithm results if we assume that the number of sources 
is equal to the number of observations, i.e., J = J. Our aim is to recover 
the source variables s (within some multiplicative factors, and possibly per- 
muted). To put it another way, we aim to create the inverse of G (within a 
post-multiplicative factor) given only a set of examples {x}. We assume that 
the latent variables are independently distributed, with marginal distributions 
P(s;|7) = p;(s;). Here H denotes the assumed form of this model and the 
assumed probability distributions p; of the latent variables. 

The probability of the observables and the hidden variables, given G and 
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H, is 
N 
P({x™ SH G, H) = [J [PR s, ePm] (842) 
n=1 


N 
Il II ô ae _ or Giis\") (11 nis) (34.3) 
n=1 j į 


We assume that the vector x is generated without noise. This assumption is 
not usually made in latent variable modelling, since noise-free data are rare; 
but it makes the inference problem far simpler to solve. 


The likelihood function 


For learning about G from the data D, the relevant quantity is the likelihood 
function 


P(D|G,H) = [eoe ) |G, H) (34.4) 
which is a product of factors each of which is obtained by marginalizing over 


the latent variables. When we marginalize over delta functions, remember 
that fds d(a — vs) f(s) = 4f(x/v). 2 adopt aN convention at this 


point, such that, for example, Gjis®” =}; Gjis™ . A single factor in the 
likelihood is given by 


PROJGH =f ale Pet ja .G,H)P(s\ |H) (34.5) 
if als TIa (2 — Gys) II pi(s®™®) (34.6) 
j 


1 zi 
= gaaj lle Tj) (34.7) 


—In|det G| + X` In p:(G3 £3). (34.8) 


i 


=> In P(x™ |G, H) 


To obtain a maximum likelihood algorithm we find the gradient of the log 
likelihood. If we introduce W = G7}, the log likelihood contributed by a 
single example may be written: 


In P(x) |G,H) = n|det W| + X` Inp;(Wij2;). (34.9) 


We'll assume from now on that det W is positive, so that we can omit the 
absolute value sign. We will need the following identities: 





g -1 
ð = = = 
ðG Gm = -G Gin = -Wy Wim (34.11) 
ð ð 
= -ujm| an i 4.12 


Let us define a; = W;j£j, 


ilai) = dln p;(a;) /daj, (34.13) 
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Algorithm 34.2. Independent 
component analysis — online 
steepest ascents version. 

See also algorithm 34.4, which is 
to be preferred. 


Repeat for each datapoint x: 
1. Put x through a linear mapping: 


a= Wx. 


2. Put a through a nonlinear map: 
zi = Gi(a), 
where a popular choice for ¢ is ¢ = — tanh (a;). 
3. Adjust the weights in accordance with 


AW œx [W] + zx". 





and z; = ¢;(a;), which indicates in which direction a; needs to change to make 
the probability of the data greater. We may then obtain the gradient with 
respect to Gj; using equations (34.10) and (34.11): 


In P(x™ |G, H) = -Wij — aizi Wij. (34.14) 





OG 5; 
Or alternatively, the derivative with respect to W;j: 


o 
Ow; 





In P(x™ |G,H) = Gji + 25%. (34.15) 


If we choose to change W so as to ascend this gradient, we obtain the learning 
rule 
AW « [W] + 2x". (34.16) 


The algorithm so far is summarized in algorithm 34.2. 


Choices of o 


The choice of the function ¢ defines the assumed prior distribution of the 
latent variable s. 

Let’s first consider the linear choice ¢;(a;) = —Kka;, which implicitly (via 
equation 34.13) assumes a Gaussian distribution on the latent variables. The 
Gaussian distribution on the latent variables is invariant under rotation of the 
latent variables, so there can be no evidence favouring any particular alignment 
of the latent variable space. The linear algorithm is thus uninteresting in that 
it will never recover the matrix G or the original sources. Our only hope is 
thus that the sources are non-Gaussian. Thankfully, most real sources have 
non-Gaussian distributions; often they have heavier tails than Gaussians. 

We thus move on to the popular tanh nonlinearity. If 


ilai) = — tanh(a;) (34.17) 


then implicitly we are assuming 


1 
ilsi 1 h(s; —. 34.18 
pi(si) œ 1/ cosh(si) x —<—— (34.18) 
This is a heavier-tailed distribution for the latent variables than the Gaussian 
distribution. 
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Figure 34.3. Illustration of the 
generative models implicit in the 
learning algorithm. 
(a) Distributions over two 
observables generated by 1/ cosh 
distributions on the latent 
3/4 1/2 
1/2 1 | 
(compact distribution) and 
G= | 5 a | (broader 
distribution). (b) Contours of the 
generative distributions when the 
latent variables have Cauchy 
distributions. The learning 
algorithm fits this amoeboid 
object to the empirical data in 
such a way as to maximize the 
likelihood. The contour plot in 
(b) does not adequately represent 
this heavy-tailed distribution. 
(c) Part of the tails of the Cauchy 
distribution, giving the contours 
0.01...0.1 times the density at 
/ S| | the origin. (d) Some data from 
z -10 0 30 one of the generative distributions 
(c) (d) a illustrated in (b) and (c). Can you 
tell which? 200 samples were 
created, of which 196 fell in the 
plotted region. 








variables, for G = 















































We could also use a tanh nonlinearity with gain 8, that is, ¢;(a;) = 
—tanh(@a;), whose implicit probabilistic model is p;(s;) x 1/[cosh(Gs;)]!/%. I 
the limit of large 3, the nonlinearity becomes a step function and the probabil- 
ity distribution p;(s;) becomes a biexponential distribution, p;(s;) x exp(—|s|). 
In the limit 8 — 0, p;(s;) approaches a Gaussian with mean zero and variance 
1/8. Heavier-tailed distributions than these may also be used. The Student 
and Cauchy distributions spring to mind. 


Example distributions 


Figures 34.3(a-c) illustrate typical distributions generated by the independent 
components model when the components have 1/cosh and Cauchy distribu- 
tions. Figure 34.3d shows some samples from the Cauchy model. The Cauchy 
distribution, being the more heavy-tailed, gives the clearest picture of how the 
predictive distribution depends on the assumed generative parameters G. 


> 34.3 A covariant, simpler, and faster learning algorithm 


We have thus derived a learning algorithm that performs steepest descents 
on the likelihood function. The algorithm does not work very quickly, even 
on toy data; the algorithm is ill-conditioned and illustrates nicely the general 
advice that, while finding the gradient of an objective function is a splendid 
idea, ascending the gradient directly may not be. The fact that the algorithm is 
ill-conditioned can be seen in the fact that it involves a matrix inverse, which 
can be arbitrarily large or even undefined. 


Covariant optimization in general 


The principle of covariance says that a consistent algorithm should give the 
same results independent of the units in which quantities are measured (Knuth, 
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1968). A prime example of a non-covariant algorithm is the popular steepest 
descents rule. A dimensionless objective function L(w) is defined, its deriva- 
tive with respect to some parameters w is computed, and then w is changed 
by the rule 

OL 


Aw; = na 





(34.19) 


This popular equation is dimensionally inconsistent: the left-hand side of this 
equation has dimensions of [w;] and the right-hand side has dimensions 1/ļ[w;]. 
The behaviour of the learning algorithm (34.19) is not covariant with respect 
to linear rescaling of the vector w. Dimensional inconsistency is not the end of 
the world, as the success of numerous gradient descent algorithms has demon- 
strated, and indeed if 7 decreases with n (during on-line learning) as 1/n then Here n is the number of iterations. 
the Munro-Robbins theorem (Bishop, 1992, p. 41) shows that the parameters 
will asymptotically converge to the maximum likelihood parameters. But the 
non-covariant algorithm may take a very large number of iterations to achieve 
this convergence; indeed many former users of steepest descents algorithms 
prefer to use algorithms such as conjugate gradients that adaptively figure 
out the curvature of the objective function. The defense of equation (34.19) 
that points out 7 could be a dimensional constant is untenable if not all the 
parameters w; have the same dimensions. 


The algorithm would be covariant if it had the form 


OL 
Aw =n >) Mw ae (34.20) 


where M is a positive-definite matrix whose i, 7’ element has dimensions [w;w;]. 
From where can we obtain such a matrix? Two sources of such matrices are 
metrics and curvatures. 


Metrics and curvatures 


If there is a natural metric that defines distances in our parameter space w, 
then a matrix M can be obtained from the metric. There is often a natural 
choice. In the special case where there is a known quadratic metric defining 
the length of a vector w, then the matrix can be obtained from the quadratic 
form. For example if the length is w? then the natural matrix is M = I, and 
steepest descents is appropriate. 

Another way of finding a metric is to look at the curvature of the objective 
function, defining A = —VVL (where V = 0/Ow). Then the matrix M = 
AT! will give a covariant algorithm; what is more, this algorithm is the Newton 
algorithm, so we recognize that it will alleviate one of the principal difficulties 
with steepest descents, namely its slow convergence to a minimum when the 
objective function is at all ill-conditioned. The Newton algorithm converges 
to the minimum in a single step if L is quadratic. 

In some problems it may be that the curvature A consists of both data- 
dependent terms and data-independent terms; in this case, one might choose 
to define the metric using the data-independent terms only (Gull, 1989). The 
resulting algorithm will still be covariant but it will not implement an exact 
Newton step. Obviously there are many covariant algorithms; there is no 
unique choice. But covariant algorithms are a small subset of the set of all 
algorithms! 
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Back to independent component analysis 


For the present maximum likelihood problem we have evaluated the gradient 
with respect to G and the gradient with respect to W = G~!. Steepest 
ascents in W is not covariant. Let us construct an alternative, covariant 
algorithm with the help of the curvature of the log likelihood. Taking the 
second derivative of the log likelihood with respect to W we obtain two terms, 
the first of which is data-independent: 


OG 5, 
OW nt 





—GyrGu, (34.21) 


and the second of which is data-dependent: 


o(zizj) 
OWx1 





= £;216;~2,, (no sum over i) (34.22) 


where 2’ is the derivative of z. It is tempting to drop the data-dependent term 
and define the matrix M by [Mt] ujan = [GjrGu]. However, this matrix 
is not positive definite (it has at least one non-positive eigenvalue), so it is 
a poor approximation to the curvature of the log likelihood, which must be 
positive definite in the neighbourhood of a maximum likelihood solution. We 
must therefore consult the data-dependent term for inspiration. The aim is 
to find a convenient approximation to the curvature and to obtain a covariant 
algorithm, not necessarily to implement an exact Newton step. What is the 
average value of x;2/6;,z;? If the true value of G is G*, then 


(25215:k2%4) = (CimSmSnGinSik%) « (34.23) 


We now make several severe approximations: we replace G* by the present 
value of G, and replace the correlated average (8m5n2;) by (8m8n)(z;) = 
XmnDi. Here © is the variance-covariance matrix of the latent variables 
(which is assumed to exist), and D; is the typical value of the curvature 
d? In p;(a)/da?. Given that the sources are assumed to be independent, © 
and D are both diagonal matrices. These approximations motivate the ma- 
trix M given by: 

[Mt] jy = GjmZmnGindicDi, (34.24) 
that is, 

Majan = Wing ZmnWri5inDj *- (34.25) 


For simplicity, we further assume that the sources are similar to each other so 
that X and D are both homogeneous, and that XD = 1. This will lead us to 
an algorithm that is covariant with respect to linear rescaling of the data x, 
but not with respect to linear rescaling of the latent variables. We thus use: 


Mijyat) = Wing Wmidix- (34.26) 


Multiplying this matrix by the gradient in equation (34.15) we obtain the 
following covariant learning algorithm: 


AW;; = n (Wij + Wy jay zi) - (34.27) 


Notice that this expression does not require any inversion of the matrix W. 
The only additional computation once z has been computed is a single back- 
ward pass through the weights to compute the quantity 


a’ = Wij au (34.28) 
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Algorithm 34.4. Independent 
component analysis — covariant 
version. 


Repeat for each datapoint x: 
1. Put x through a linear mapping: 


a= Wx. 


. Put a through a nonlinear map: 


zi = oi(ai), 


where a popular choice for ¢ is ¢ = — tanh (a;). 


. Put a back through W: 
x’ = W'a. 
. Adjust the weights in accordance with 


AW x W + zx". 





in terms of which the covariant algorithm reads: 
AW; =n (Wi; + zizi) À (34.29) 


The quantity (Wi + Ea) on the right-hand side is sometimes called the 
natural gradient. The covariant independent component analysis algorithm is 
summarized in algorithm 34.4. 


Further reading 


ICA was originally derived using an information maximization approach (Bell 
and Sejnowski, 1995). Another view of ICA, in terms of energy functions, 
which motivates more general models, is given by Hinton et al. (2001). Another 
generalization of ICA can be found in Pearlmutter and Parra (1996, 1997). 
There is now an enormous literature on applications of ICA. A variational free 
energy minimization approach to ICA-like models is given in (Miskin, 2001; 
Miskin and MacKay, 2000; Miskin and MacKay, 2001). Further reading on 
blind separation, including non-ICA algorithms, can be found in (Jutten and 
Herault, 1991; Comon et al., 1991; Hendin et al., 1994; Amari et al., 1996; 
Hojen-Sorensen et al., 2002). 


Infinite models 


While latent variable models with a finite number of latent variables are widely 
used, it is often the case that our beliefs about the situation would be most 
accurately captured by a very large number of latent variables. 

Consider clustering, for example. If we attack speech recognition by mod- 
elling words using a cluster model, how many clusters should we use? The 
number of possible words is unbounded (section 18.2), so we would really like 
to use a model in which it’s always possible for new clusters to arise. 

Furthermore, if we do a careful job of modelling the cluster corresponding 
to just one English word, we will probably find that the cluster for one word 
should itself be modelled as composed of clusters — indeed, a hierarchy of 
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clusters within clusters. The first levels of the hierarchy would divide male 
speakers from female, and would separate speakers from different regions — 
India, Britain, Europe, and so forth. Within each of those clusters would be 
subclusters for the different accents within each region. The subclusters could 
have subsubclusters right down to the level of villages, streets, or families. 

Thus we would often like to have infinite numbers of clusters; in some 
cases the clusters would have a hierarchical structure, and in other cases the 
hierarchy would be flat. So, how should such infinite models be implemented 
in finite computers? And how should we set up our Bayesian models so as to 
avoid getting silly answers? 

Infinite mixture models for categorical data are presented in Neal (1991), 
along with a Monte Carlo method for simulating inferences and predictions. 
Infinite Gaussian mixture models with a flat hierarchical structure are pre- 
sented in Rasmussen (2000). Neal (2001) shows how to use Dirichlet diffusion 
trees to define models of hierarchical clusters. Most of these ideas build on 
the Dirichlet process (section 18.2). This remains an active research area 
(Rasmussen and Ghahramani, 2002; Beal et al., 2002). 


> 34.4 Exercises 


Exercise 34.1.15] Repeat the derivation of the algorithm, but assume a small 
amount of noise in x: x = Gs + n; so the term 6 (a =e: Gyis\”) 
in the joint probability (34.3) is replaced by a probability distribution 
over z™ with mean DSF Gas ™. Show that, if this noise distribution has 
sufficiently small standard deviation, the identical algorithm results. 


Exercise 34.2.15] Implement the covariant ICA algorithm and apply it to toy 
data. 


Exercise 34.3. 4-5] Create algorithms appropriate for the situations: (a) x in- 
cludes substantial Gaussian noise; (b) more measurements than latent 
variables (J > I); (c) fewer measurements than latent variables (J < T). 


Factor analysis assumes that the observations x can be described in terms of 
independent latent variables {s4} and independent additive noise. Thus the 
observable x is given by 

x= Gs +n, (34.30) 


where n is a noise vector whose components have a separable probability distri- 
bution. In factor analysis it is often assumed that the probability distributions 
of {są} and {n;} are zero-mean Gaussians; the noise terms may have different 
variances Oe: 


Exercise 34.4.4] Make a maximum likelihood algorithm for inferring G from 
data, assuming the generative model x = Gs + n is correct and that s 
and n have independent Gaussian distributions. Include parameters a; 
to describe the variance of each n;, and maximize the likelihood with 


respect to them too. Let the variance of each s; be 1. 


Exercise 34.5. 4C] Implement the infinite Gaussian mixture model of Rasmussen 


(2000). 
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»> 35.1 What do you know if you are ignorant? 
80 


Example 35.1. A real variable x is measured in an accurate experiment. For 70 oe 
example, x might be the half-life of the neutron, the wavelength of light 60 200 
emitted by a firefly, the depth of Lake Vostok, or the mass of Jupiter’s 50 2000 
moon Io. 
40 
What is the probability that the value of x starts with a ‘1’, like the 
charge of the electron (in S.I. units), 30 a 
=i 80 1000 
e = 1.602... x 107" C, 70 900 
20 800 
and the Boltzmann constant, = 100 
50 600 
k = 1.38066... x 10° JKT}? 40 500 
And what is the probability that it starts with a ‘9’, like the Faraday a 30 ae 
constant, 8 
i Ey 300 
F = 9.648... x 10* C mol? 7 
6 20 
What about the second digit? What is the probability that the mantissa 5 200 
of x starts ‘1.1...’, and what is the probability that x starts ‘9.9...’? P 
Solution. An expert on neutrons, fireflies, Antarctica, or Jove might be able to 
predict the value of x, and thus predict the first digit with some confidence, but 3 N 
what about someone with no knowledge of the topic? What is the probability 8 5a 
distribution corresponding to ‘knowing nothing’? 2 T 80 
One way to attack this question is to notice that the units of x have not 6 70 
been specified. If the half-life of the neutron were measured in fortnights 5 60 


instead of seconds, the number x would be divided by 1209600; if it were 4 50 
measured in years, it would be divided by 3 x 107”. Now, is our knowledge 
about x, and, in particular, our knowledge of its first digit, affected by the 
change in units? For the expert, the answer is yes; but let us take someone 
truly ignorant, for whom the answer is no; their predictions about the first digit Figure 35:1. When viewed cue 
of x are independent of the units. The arbitrariness of the units corresponds to logarithmic scale, scales using 
invariance of the probability distribution when z is multiplied by any number. different units are translated 
relative to each other. 


al 40 


metres feet inches 


If you don’t know the units that a quantity is measured in, the probability 
of the first digit must be proportional to the length of the corresponding piece 
of logarithmic scale. The probability that the first digit of a number is 1 is 
thus 

log 2 — log 1 log 2 
~ log10—log1 log 10° 





pı (35.1) 
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Now, 21° = 1024 ~ 10 = 1000, so without needing a calculator, we have Poir 0 
10 log 2 ~ 3log 10 and iL 9 
3 8 
x —. 2 
Pı S ip (35.2) : 
More generally, the probability that the first digit is d is 5 
(log(d + 1) — log(d))/(log 10 — log 1) = log;)(1 + 1/d). (35.3) 4 
P(3) 
This observation about initial digits is known as Benford’s law. Ignorance 3 
does not correspond to a uniform probability distribution over d. 
2 
> Exercise 35.2.1] A pin is thrown tumbling in the air. What is the probability 
distribution of the angle 6; between the pin and the vertical at a moment P(1) 
while it is in the air? The tumbling pin is photographed. What is the 
probability distribution of the angle 63 between the pin and the vertical 
1 


as imaged in the photograph? 


> Exercise 35.3.7] Record breaking. Consider keeping track of the world record 
for some quantity x, say earthquake magnitude, or longjump distances 
jumped at world championships. If we assume that attempts to break 
the record take place at a steady rate, and if we assume that the under- 
lying probability distribution of the outcome zx, P(x), is not changing — 
an assumption that I think is unlikely to be true in the case of sports 
endeavours, but an interesting assumption to consider nonetheless — and 
assuming no knowledge at all about P(x), what can be predicted about 
successive intervals between the dates when records are broken? 


> 35.2 The Luria—Delbriick distribution 


Exercise 35.4.90 P-449] In their landmark paper demonstrating that bacteria 
could mutate from virus sensitivity to virus resistance, Luria and Delbrück 
(1943) wanted to estimate the mutation rate in an exponentially-growing pop- 
ulation from the total number of mutants found at the end of the experi- 
ment. This problem is difficult because the quantity measured (the number 
of mutated bacteria) has a heavy-tailed probability distribution: a mutation 
occuring early in the experiment can give rise to a huge number of mutants. 
Unfortunately, Luria and Delbrück didn’t know Bayes’ theorem, and their way 
of coping with the heavy-tailed distribution involves arbitrary hacks leading to 
two different estimators of the mutation rate. One of these estimators (based 
on the mean number of mutated bacteria, averaging over several experiments) 
has appallingly large variance, yet sampling theorists continue to use it and 
base confidence intervals around it (Kepler and Oprea, 2001). In this exercise 
you'll do the inference right. 

In each culture, a single bacterium that is not resistant gives rise, after g 
generations, to N = 29 descendants, all clones except for differences arising 
from mutations. The final culture is then exposed to a virus, and the number 
of resistant bacteria n is measured. According to the now accepted mutation 
hypothesis, these resistant bacteria got their resistance from random mutations 
that took place during the growth of the colony. The mutation rate (per cell 
per generation), a, is about one in a hundred million. The total number of 
opportunities to mutate is N, since ae 2t ~ 29 = N. Ifa bacterium mutates 
at the ith generation, its descendants all inherit the mutation, and the final 
number of resistant bacteria contributed by that one ancestor is 29~*. 
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Given M separate experiments, in each of which a colony of size N is 
created, and where the measured numbers of resistant bacteria are Timi i 
what can we infer about the mutation rate, a? 

Make the inference given the following dataset from Luria and Delbrück, 
for N = 2.4 x 108: {nm} = {1,0,3,0,0,5,0, 5,0,6, 107, 0,0, 0,1, 0,0, 64, 0,35}. 
[A small amount of computation is required to solve this problem.] 


> 35.3 Inferring causation 

Exercise 35.5.1% P450] In the Bayesian graphical model community, the task 

= of inferring which way the arrows point — that is, which nodes are parents, 
and which children — is one on which much has been written. 

Inferring causation is tricky because of ‘likelihood equivalence’. Two graph- 
ical models are likelihood-equivalent if for any setting of the parameters of 
either, there exists a setting of the parameters of the other such that the two 
joint probability distributions of all observables are identical. An example of 
a pair of likelihood-equivalent models are A — B and B — A. The model 
A — B asserts that A is the parent of B, or, in very sloppy terminology, ‘A 
causes B’. An example of a situation where ‘B — A’ is true is the case where 
B is the variable ‘burglar in house’ and A is the variable ‘alarm is ringing’. 
Here it is literally true that B causes A. But this choice of words is confusing if 
applied to another example, R — D, where R denotes ‘it rained this morning’ 
and D denotes ‘the pavement is dry’. ‘R causes D’ is confusing. Pll therefore 
use the words ‘B is a parent of A’ to denote causation. Some statistical meth- 
ods that use the likelihood alone are unable to use data to distinguish between 
likelihood-equivalent models. In a Bayesian approach, on the other hand, two 
likelihood-equivalent models may nevertheless be somewhat distinguished, in 
the light of data, since likelihood-equivalence does not force a Bayesian to use 
priors that assign equivalent densities over the two parameter spaces of the 
models. 

However, many Bayesian graphical modelling folks, perhaps out of sym- 
pathy for their non-Bayesian colleagues, or from a latent urge not to appear 
different from them, deliberately discard this potential advantage of Bayesian 
methods — the ability to infer causation from data — by skewing their models 
so that the ability goes away; a widespread orthodoxy holds that one should 
identify the choices of prior for which ‘prior equivalence’ holds, i.e., the priors 
such that models that are likelihood-equivalent also have identical posterior 
probabilities; and then one should use one of those priors in inference and 
prediction. This argument motivates the use, as the prior over all probability 
vectors, of specially-constructed Dirichlet distributions. 

In my view it is a philosophical error to use only those priors such that 
causation cannot be inferred. Priors should be set to describe one’s assump- 
tions; when this is done, it’s likely that interesting inferences about causation 
can be made from data. 

In this exercise, you’ll make an example of such an inference. 

Consider the toy problem where A and B are binary variables. The two 
models are Hy4_,p and Hp_.4. Ha_.p asserts that the marginal probabil- 
ity of A comes from a beta distribution with parameters (1,1), i.e., the uni- 
form distribution; and that the two conditional distributions P(b|a=0) and 
P(b|a=1) also come independently from beta distributions with parameters 
(1,1). The other model assigns similar priors to the marginal probability of 
B and the conditional distributions of A given B. Data are gathered, and the 
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counts, given F = 1000 outcomes, are 


a=0 a=1 
=0 760 5 765 
1 190 45 | 235 
950 50 


(35.4) 


What are the posterior probabilities of the two hypotheses? 


Hint: it’s a good idea to work this exercise out symbolically in order to spot 
all the simplifications that emerge. 


d 1 2 = 
V(x) = ae InT(a) ~ ln(x) — a O(1/2*). (35.5) 
The topic of inferring causation is a complex one. The fact that Bayesian 
inference can sensibly be used to infer the directions of arrows in graphs seems 
to be a neglected view, but it is certainly not the whole story. See Pearl (2000) 
for discussion of many other aspects of causality. 


> 35.4 Further exercises 


Exercise 35.6.19] Photons arriving at a photon detector are believed to be emit- 
ted as a Poisson process with a time-varying rate, 


A(t) = exp(a + bsin(wt + )), (35.6) 


where the parameters a, b, w, and @ are known. Data are collected during 
the time t = 0...T. Given that N photons arrived at times {t,}4_,, 
discuss the inference of a, b, w, and ¢. [Further reading: Gregory and 
Loredo (1992).] 
> Exercise 35.7.!?1 
printed in such a way that the boundaries between the columns are 


unclear. Here are the resulting strings. 


891.10.0 | 912.20.0 | 874.10.0 | 870.20.0 | 836.10.0 | 861.20.0 
903.10.0 | 937.10.0 | 850.20.0 | 916.20.0 | 899.10.0 | 907.10.0 
924.20.0 | 861.10.0 | 899.20.0 | 849.10.0 | 887.20.0 | 840.10.0 
849.20.0 | 891.10.0 | 916.20.0 | 891.10.0 | 912.20.0 | 875.10.0 
898.20.0 | 924.10.0 | 950.20.0 | 958.10.0 | 971.20.0 | 933.10.0 
966.20.0 | 908.10.0 | 924.20.0 | 983.10.0 | 924.20.0 | 908.10.0 
950.20.0 | 911.10.0 | 913.20.0 | 921.25.0 | 912.20.0 | 917.30.0 
923.50.0 


Discuss how probable it is, given these data, that the correct parsing of 
each item is: 


A data file consisting of two columns of numbers has been 





(a) 891.10.0 — 891. 10.0, etc. 
(b) 891.10.0 — 891.1 0.0, etc. 


[A parsing of a string is a grammatical interpretation of the string. For 
example, ‘Punch bores’ could be parsed as ‘Punch (noun) bores (verb)’, 
or ‘Punch (imperative verb) bores (plural noun)’.| 


> Exercise 35.8.!?] In an experiment, the measured quantities {£} come inde- 
pendently from a biexponential distribution with mean yp, 


P(e |u) = Z exp( le nl), 
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where Z is the normalizing constant, Z = 2. The mean p is not known. 
An example of this distribution, with u = 1, is shown in figure 35.2. 


Assuming the four datapoints are 





{xn} = {0, 0.9, 2, 6}, 012 3 4 5 6 7 8B 


what do these data tell us about u? Include detailed sketches in your 
answer. Give a range of plausible values of u. 





> 35.5 Solutions 


Figure 35.2. The biexponential 


Solution to exercise 35.4 (p.446). A population of size N has N opportunities distibution Prine 1). 


to mutate. The probability of the number of mutations that occurred, r, is 


roughly Poisson 
—aN (aN) 


P(r|a, N) =e ai 





(35.7) 


(This is slightly inaccurate because the descendants of a mutant cannot them- 
selves undergo the same mutation.) Each mutation gives rise to a number of 
final mutant cells n; that depends on the generation time of the mutation. If 
multiplication went like clockwork then the probability of n; being 1 would 
be 1/2, the probability of 2 would be 1/4, the probability of 4 would be 1/8, 
and P(n;) = 1/(2n) for all n; that are powers of two. But we don’t expect 
the mutant progeny to divide in exact synchrony, and we don’t know the pre- 
cise timing of the end of the experiment compared to the division times. A 
smoothed version of this distribution that permits all integers to occur is 


(35.8) 


where Z = 1/6 = 1.645. [This distribution’s moments are all wrong, since 
ni can never exceed N, but who cares about moments? — only sampling 
theory statisticians who are barking up the wrong tree, constructing ‘unbiased 
estimators’ such as â = (7/N)/log N. The error that we introduce in the 
likelihood function by using the approximation to P(n;) is negligible.] 

The observed number of mutants n is the sum 


n= >on. (35.9) 
i=1 


The probability distribution of n given r is the convolution of r identical 
distributions of the form (35.8). For example, 


n—1 
1 1 1 
nı=1 


The probability distribution of n given a, which is what we need for the 
Bayesian inference, is given by summing over r. 


N 
P(n|a) =X P(n|r)P(r|a, N). (35.11) 
r=0 


This quantity can’t be evaluated analytically, but for small a, it’s easy to 
evaluate to any desired numerical precision by explicitly summing over r from 
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r = 0 to some fmax, with P(n |r) also being found for each r by rmax explicit 
convolutions for all required values of n; if Tmax = Nmax, the largest value 
of n encountered in the data, then P(n|a) is computed exactly; but for this 
question’s data, Tmax = 9 is plenty for an accurate result; I used rmax = 
74 to make the graphs in figure 35.3. Octave source code is available. 1 

Incidentally, for data sets like the one in this exercise, which have a substantial 
number of zero counts, very little is lost by making Luria and Delbruck’s second | | \ 
approximation, which is to retain only the count of how many n were equal to get | 
zero, and how many were non-zero. The likelihood function found using this 


weakened data set, 
L(a) = (H1 — e), (35.12) 


is scarcely distinguishable from the likelihood computed using full information. 


Solution to exercise 35.5 (p.447). From the six terms of the form 


o T,X + am) Tr(a) 


P(F | am) = ht) romi (35.13) 





most factors cancel and all that remains is 


Figure 35.3. Likelihood of the 
P(Ha—s | Data) = (765 + 1)(235 + 1) = aS (35.14) mutation rate a on a linear scale 

P(Hp-a | Data) (950 + 1)(50 + 1) 1 and log scale, given Luria and 

. : . 5.8 Delbruck’s data. Vertical axis: 
There is modest evidence in favour of HA—pg because the three probabilities likelihood /10-23; horizontal axis: 


inferred for that hypothesis (roughly 0.95, 0.8, and 0.1) are more typical of a. 
the prior than are the three probabilities inferred for the other (0.24, 0.008, 

and 0.19). This statement sounds absurd if we think of the priors as ‘uniform’ 

over the three probabilities — surely, under a uniform prior, any settings of the 
probabilities are equally probable? But in the natural basis, the logit basis, 

the prior is proportional to p(1 — p), and the posterior probability ratio can 

be estimated by 


0.95 x 0.05 x 0.8 x 0.2 x 0.1 x 0.9 ie 


—— 39.15 
0.24 x 0.76 x 0.008 x 0.992 x 0.19 x 0.81 — 1’ ( ) 


which is not exactly right, but it does illustrate where the preference for A —> B 
is coming from. 


lwww.inference.phy.cam.ac.uk/itprnn/code/octave/luriaO.m 
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Decision Theory 


Decision theory is trivial, apart from computational details (just like playing 
chess!). 

You have a choice of various actions, a. The world may be in one of many 
states x; which one occurs may be influenced by your action. The world’s 
state has a probability distribution P(x |a). Finally, there is a utility function 
U(x, a) which specifies the payoff you receive when the world is in state x and 
you chose action a. 

The task of decision theory is to select the action that maximizes the 
expected utility, 


E[U |a] = [atx U(x, a)P(x|a). (36.1) 


That’s all. The computational problem is to maximize E[U | a] over a. [Pes- 
simists may prefer to define a loss function L instead of a utility function U 
and minimize the expected loss.] 

Is there anything more to be said about decision theory? 

Well, in a real problem, the choice of an appropriate utility function may 
be quite difficult. Furthermore, when a sequence of actions is to be taken, 
with each action providing information about x, we have to take into account 
the effect that this anticipated information may have on our subsequent ac- 
tions. The resulting mixture of forward probability and inverse probability 
computations in a decision problem is distinctive. In a realistic problem such 
as playing a board game, the tree of possible cogitations and actions that must 
be considered becomes enormous, and ‘doing the right thing’ is not simple, 
because the expected utility of an action cannot be computed exactly (Russell 
and Wefald, 1991; Baum and Smith, 1993; Baum and Smith, 1997). 

Let’s explore an example. 


> 36.1 Rational prospecting 


Suppose you have the task of choosing the site for a Tanzanite mine. Your 
final action will be to select the site from a list of N sites. The nth site has 
a net value called the return x, which is initially unknown, and will be found 
out exactly only after site n has been chosen. fx, equals the revenue earned 
from selling the Tanzanite from that site, minus the costs of buying the site, 
paying the staff, and so forth.] At the outset, the return £n has a probability 
distribution P(xn), based on the information already available. 

Before you take your final action you have the opportunity to do some 
prospecting. Prospecting at the nth site has a cost c, and yields data dn 
which reduce the uncertainty about zn. [We'll assume that the returns of 


451 
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the N sites are unrelated to each other, and that prospecting at one site only 
yields information about that site and doesn’t affect the return from that site.] 
Your decision problem is: 


given the initial probability distributions P(x1), P(x2),..., P(an), 
first, decide whether to prospect, and at which sites; then, in the 
light of your prospecting results, choose which site to mine. 


For simplicity, let’s make everything in the problem Gaussian and focus The notation 
on the question of whether to prospect once or not. We’ll assume our utility P(y) = Normal(y; H, 0 a) indicates 
function is linear in £n; we wish to maximize our expected return. The utility that y has Gaussian distribution 


Kuh é with mean u and variance o?. 
function is 


Ore, (36.2) 


if no prospecting is done, where n, is the chosen ‘action’ site; and, if prospect- 
ing is done, the utility is 
U = —Cnp + Znas (36.3) 


where np is the site at which prospecting took place. 
The prior distribution of the return of site n is 


P(£n) = Normal (£n; pin, 07). (36.4) 
If you prospect at site n, the datum d, is a noisy version of £p: 


P(dn | £n) = Normal(dn; £n, 07). (36.5) 


> Exercise 36.1.7] Given these assumptions, show that the prior probability dis- 
tribution of dn is 


P(dn) = Normal(d,; Hn, a? +07) (36.6) 


(mnemonic: when independent variables add, variances add), and that 
the posterior distribution of £n given dn is 





P(z£n | dn) = Normal (2m: Uh» až’) (36.7) 
where Jo? Jo? 
dyn /o* + Un/o 1 1 1 
CST ga ae 36.8 
Bn 1/0? + 1/02 o2' o ¥ a2 (36:8) 


(mnemonic: when Gaussians multiply, precisions add). 


To start with, let’s evaluate the expected utility if we do no prospecting (i.e., 
choose the site immediately); then we’ll evaluate the expected utility if we first 
prospect at one site and then make our choice. From these two results we will 
be able to decide whether to prospect once or zero times, and, if we prospect 
once, at which site. 

So, first we consider the expected utility without any prospecting. 


Exercise 36.2.!7] Show that the optimal action, assuming no prospecting, is to 
>= select the site with biggest mean 


Tia = argmax Hn (36.9) 


n 


and the expected utility of this action is 
E[U | optimal n] = max un. (36.10) 
n 
[If your intuition says ‘surely the optimal decision should take into ac- 


count the different uncertainties op too?’, the answer to this question is 
‘reasonable — if so, then the utility function should be nonlinear in 2’.] 
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Now the exciting bit. Should we prospect? Once we have prospected at 
site Np, we will choose the site using the decision rule (36.9) with the value of 
mean Hn, replaced by the updated value ju}, given by (36.8). What makes the 
problem exciting is that we don’t yet know the value of dn, so we don’t know 
what our action Nna will be; indeed the whole value of doing the prospecting 
comes from the fact that the outcome dn may alter the action from the one 
that we would have taken in the absence of the experimental information. 

From the expression for the new mean in terms of dn (36.8), and the known 
variance of d, (36.6), we can compute the probability distribution of the key 
quantity, u’, and can work out the expected utility by integrating over all 
possible outcomes and their associated actions. 


“Exercise 36.3.7] Show that the probability distribution of the new mean pu’, 


= (36.8) is Gaussian with mean un and variance 
2 
eae, n 
= 0 —=— s. 36.11 
Psota (36.11) 


Consider prospecting at site n. Let the biggest mean of the other sites be 
Hı- When we obtain the new value of the mean, u’, we will choose site n and 
get an expected return of u’, if u, > p1, and we will choose site 1 and get an 
expected return of pı if wi, < u. 

So the expected utility of prospecting at site n, then picking the best site, 
is 


CO 
EJU | prospect at n] = —cn + P(wh, < u1) ua +f dul, wl, Normal( u; un, 8”). 

















Hı on 
(36.12) 3.5 
The difference in utility between prospecting and not prospecting is the fhe 
quantity of interest, and it depends on what we would have done without lee 
prospecting; and that depends on whether jp; is bigger than un. | is 
—py if pı > pn il 
E[U | no prospecting] = : 36.13 J o. 
(U Ino prospecting] = {Th men" a | dos 
6 4 2 0 2 4 6 
So (Ln a yı) 
E[U | prospect at n] — E[U | no prospecting] 
E Figure 36.1. Contour plot of the 
2 $ 
Cn + f dun (Hn — Hi) Normal( pn; Hn, 5°) if wi È yn gain in expected utility due to 
= fia (36.14) prospecting. The contours are 
Cy + | dun (Ha — Hn) Normal (up; Hn, 5°) if m < pn. equally spaced from 0.1 to 1.2 in 
—Co 


steps of 0.1. To decide whether it 


We can plot the change in expected utility due to prospecting (omitting ee = ce 
n (the 


Cn) as a function of the difference (un — p1) (horizontal axis) and the initial of prospecting); all points 
standard deviation on (vertical axis). In the figure the noise variance is o? = 1. [(un— u1), on] above that contour 
are worthwhile. 


> 36.2 Further reading 


If the world in which we act is a little more complicated than the prospecting 
problem — for example, if multiple iterations of prospecting are possible, and 
the cost of prospecting is uncertain — then finding the optimal balance between 
exploration and exploitation becomes a much harder computational problem. 
Reinforcement learning addresses approximate methods for this problem (Sut- 
ton and Barto, 1998). 
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> 36.3 Further exercises 


[2] 


> Exercise 36.4. The four doors problem. 


A new game show uses rules similar to those of the three doors (exer- 
cise 3.8 (p.57)), but there are four doors, and the host explains: ‘First 
you will point to one of the doors, and then I will open one of the other 
doors, guaranteeing to choose a non-winner. Then you decide whether 
to stick with your original pick or switch to one of the remaining doors. 
Then I will open another non-winner (but never the current pick). You 
will then make your final decision by sticking with the door picked on 
the previous decision or by switching to the only other remaining door.’ 


What is the optimal strategy? Should you switch on the first opportu- 
nity? Should you switch on the second opportunity? 


> Exercise 36.5.19] One of the challenges of decision theory is figuring out ex- 
actly what the utility function is. The utility of money, for example, is 
notoriously nonlinear for most people. 


In fact, the behaviour of many people cannot be captured by a coher- 
ent utility function, as illustrated by the Allais paradox, which runs as 
follows. 


Which of these choices do you find most attractive? 
A. £1 million guaranteed. 
B. 89% chance of £1 million; 
10% chance of £2.5 million; 
1% chance of nothing. 
Now consider these choices: 
C. 89% chance of nothing; 
11% chance of £1 million. 
D. 90% chance of nothing; 
10% chance of £2.5 million. 


Many people prefer A to B, and, at the same time, D to C. Prove 
that these preferences are inconsistent with any utility function U(x) 
for money. 


Exercise 36.6.4! Optimal stopping. 


A large queue of N potential partners is waiting at your door, all asking 
to marry you. They have arrived in random order. As you meet each 
partner, you have to decide on the spot, based on the information so 
far, whether to marry them or say no. Each potential partner has a 
desirability d,, which you find out if and when you meet them. You 
must marry one of them, but you are not allowed to go back to anyone 
you have said no to. 


There are several ways to define the precise problem. 


(a) Assuming your aim is to maximize the desirability dn, i.e., your 
utility function is dj, where n is the partner selected, what strategy 
should you use? 


(b) Assuming you wish very much to marry the most desirable person 
(i.e., your utility function is 1 if you achieve that, and zero other- 
wise); what strategy should you use? 
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(c) Assuming you wish very much to marry the most desirable person, 
and that your strategy will be ‘strategy M’: 


Strategy M — Meet the first M partners and say no to all 
of them. Memorize the maximum desirability dmax among 
them. Then meet the others in sequence, waiting until a 
partner with dn > dmax comes along, and marry them. 
If none more desirable comes along, marry the final Nth 
partner (and feel miserable). 


— what is the optimal value of M? 


Exercise 36.7.1"! Regret as an objective function? Action 
The preceding exercise (parts b and c) involved a utility function based Buy Don’t 
on regret. If one married the tenth most desirable candidate, the utility Outcome buy 
function asserts that one would feel regret for having not chosen the No win =i 0 
most desirable. Wins +9 0 


Many people working in learning theory and decision theory use ‘mini- 
mizing the maximal possible regret’ as an objective function, but does Table 36.2. Utility in the lottery 
this make sense? ticket problem. 


Imagine that Fred has bought a lottery ticket, and offers to sell it to you 


before it’s known whether the ticket is a winner. For simplicity say the Action 
probability that the ticket is a winner is 1/100, and if it is a winner, it Buy Don’t 
is worth £10. Fred offers to sell you the ticket for £1. Do you buy it? Outcome buy 
The possible actions are ‘buy’ and ‘don’t buy’. The utilities of the four No win 1 0 
possible action—outcome pairs are shown in table 36.2. I have assumed Wins 0 9 


that the utility of small amounts of money for you is linear. If you don’t 

buy the ticket then the utility is zero regardless of whether the ticket Table 36.3. Regret in the lottery 
proves to be a winner. If you do buy the ticket you end up either losing ticket problem. 
one pound (with probability 99/100) or gaining nine (with probability 

1/100). In the minimax regret community, actions are chosen to mini- 

mize the maximum possible regret. The four possible regret outcomes 

are shown in table 36.3. If you buy the ticket and it doesn’t win, you 

have a regret of £1, because if you had not bought it you would have 

been £1 better off. If you do not buy the ticket and it wins, you have 

a regret of £9, because if you had bought it you would have been £9 

better off. The action that minimizes the maximum possible regret is 

thus to buy the ticket. 


Discuss whether this use of regret to choose actions can be philosophi- 
cally justified. 


The above problem can be turned into an investment portfolio decision 
problem by imagining that you have been given one pound to invest in 
two possible funds for one day: Fred’s lottery fund, and the cash fund. If 
you put £ fı into Fred’s lottery fund, Fred promises to return £9 f1 to you 
if the lottery ticket is a winner, and otherwise nothing. The remaining 
£ fo (with fo = 1 — fi) is kept as cash. What is the best investment? 
Show that the minimax regret community will invest fı = 9/10 of their 
money in the high risk, high return lottery fund, and only fp = 1/10 in 
cash. Can this investment method be justified? 


Exercise 36.8.!¥] Gambling oddities (from Cover and Thomas (1991)). A horse 
race involving J horses occurs repeatedly, and you are obliged to bet 
all your money each time. Your bet at time t can be represented by 
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a normalized probability vector b multiplied by your money m(t). The 
odds offered by the bookies are such that if horse i wins then your return 
is m(t+1) = bjo;m(t). Assuming the bookies’ odds are ‘fair’, that is, 


1 
So-=1 (36.15) 
— Oi 
a 
and assuming that the probability that horse 7 wins is p;, work out the 
optimal betting strategy if your aim is Cover’s aim, namely, to maximize 
the expected value of logm(T). Show that the optimal strategy sets b 
equal to p, independent of the bookies’ odds o. Show that when this 
strategy is used, the money is expected to grow exponentially as: 


27W (b.p) (36.16) 


where W = 3°, pi log bjo;. 
If you only bet once, is the optimal strategy any different? 


Do you think this optimal strategy makes sense? Do you think that it’s 
‘optimal’, in common language, to ignore the bookies’ odds? What can 
you conclude about ‘Cover’s aim’? 


Exercise 36.9.13] Two ordinary dice are thrown repeatedly; the outcome of 


each throw is the sum of the two numbers. Joe Shark, who says that 6 
and 8 are his lucky numbers, bets even money that a 6 will be thrown 
before the first 7 is thrown. If you were a gambler, would you take the 
bet? What is your probability of winning? Joe then bets even money 
that an 8 will be thrown before the first 7 is thrown. Would you take 
the bet? 


Having gained your confidence, Joe suggests combining the two bets into 
a single bet: he bets a larger sum, still at even odds, that an 8 and a 
6 will be thrown before two 7s have been thrown. Would you take the 
bet? What is your probability of winning? 


36 — Decision Theory 
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Bayesian Inference and Sampling Theory 


There are two schools of statistics. Sampling theorists concentrate on having 
methods guaranteed to work most of the time, given minimal assumptions. 
Bayesians try to make inferences that take into account all available informa- 
tion and answer the question of interest given the particular data set. As you 
have probably gathered, I strongly recommend the use of Bayesian methods. 


Sampling theory is the widely used approach to statistics, and most pa- 
pers in most journals report their experiments using quantities like confidence 
intervals, significance levels, and p-values. A p-value (e.g. p = 0.05) is the prob- 
ability, given a null hypothesis for the probability distribution of the data, that 
the outcome would be as extreme as, or more extreme than, the observed out- 
come. Untrained readers — and perhaps, more worryingly, the authors of many 
papers — usually interpret such a p-value as if it is a Bayesian probability (for 
example, the posterior probability of the null hypothesis), an interpretation 
that both sampling theorists and Bayesians would agree is incorrect. 


In this chapter we study a couple of simple inference problems in order to 
compare these two approaches to statistics. 


While in some cases, the answers from a Bayesian approach and from sam- 
pling theory are very similar, we can also find cases where there are significant 
differences. We have already seen such an example in exercise 3.15 (p.59), 
where a sampling theorist got a p-value smaller than 7%, and viewed this as 
strong evidence against the null hypothesis, whereas the data actually favoured 
the null hypothesis over the simplest alternative. On p.64, another example 
was given where the p-value was smaller than the mystical value of 5%, yet the 
data again favoured the null hypothesis. Thus in some cases, sampling theory 
can be trigger-happy, declaring results to be ‘sufficiently improbable that the 
null hypothesis should be rejected’, when those results actually weakly sup- 
port the null hypothesis. As we will now see, there are also inference problems 
where sampling theory fails to detect ‘significant’ evidence where a Bayesian 
approach and everyday intuition agree that the evidence is strong. Most telling 
of all are the inference problems where the ‘significance’ assigned by sampling 
theory changes depending on irrelevant factors concerned with the design of 
the experiment. 


This chapter is only provided for those readers who are curious about the 
sampling theory / Bayesian methods debate. If you find any of this chapter 
tough to understand, please skip it. There is no point trying to understand 
the debate. Just use Bayesian methods — they are much easier to understand 
than the debate itself! 


457 
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»> 37.1 A medical example 


We are trying to reduce the incidence of an unpleasant disease 
called microsoftus. Two vaccinations, A and B, are tested on 
a group of volunteers. Vaccination B is a control treatment, a 
placebo treatment with no active ingredients. Of the 40 subjects, 
30 are randomly assigned to have treatment A and the other 10 
are given the control treatment B. We observe the subjects for one 
year after their vaccinations. Of the 30 in group A, one contracts 
microsoftus. Of the 10 in group B, three contract microsoftus. 


Is treatment A better than treatment B? 


Sampling theory has a go 


The standard sampling theory approach to the question ‘is A better than B?’ 
is to construct a statistical test. The test usually compares a hypothesis such 
as 


Hı: ‘A and B have different effectivenesses’ 
with a null hypothesis such as 
Ho: ‘A and B have exactly the same effectivenesses as each other’. 


A novice might object ‘no, no, I want to compare the hypothesis “A is better 
than B” with the alternative “B is better than A”!’ but such objections are 
not welcome in sampling theory. 

Once the two hypotheses have been defined, the first hypothesis is scarcely 
mentioned again — attention focuses solely on the null hypothesis. It makes me 
laugh to write this, but it’s true! The null hypothesis is accepted or rejected 
purely on the basis of how unexpected the data were to Ho, not on how much 
better Hı predicted the data. One chooses a statistic which measures how 
much a data set deviates from the null hypothesis. In the example here, the 
standard statistic to use would be one called x? (chi-squared). To compute 
x7, we take the difference between each data measurement and its expected 
value assuming the null hypothesis to be true, and divide the square of that 
difference by the variance of the measurement, assuming the null hypothesis to 
be true. In the present problem, the four data measurements are the integers 
F44, Fa_, Fg, and Fp, that is, the number of subjects given treatment A 
who contracted microsoftus (F'4+), the number of subjects given treatment A 
who didn’t (F'4_), and so forth. The definition of x? is: 


x? = 5 (F; oan i (37.1) 


i 


Actually, in my elementary statistics book (Spiegel, 1988) I find Yates’s cor- 


rection is recommended: If you want to know about Yates’s 
; correction, read a sampling theory 
2 y (|F; = (Fi)| — 0.5) (37.2) textbook. The point of this 
x = F. ` : chapter is not to teach samplin 
i 8 


: theory; I merely mention Yates’s 
correction because it is what a 
professional sampling theorist 


might use. 


In this case, given the null hypothesis that treatments A and B are equally 
effective, and have rates f} and f_ for the two outcomes, the expected counts 
are: 


(Fa+)=f+NA (Fa-)= f-N4 


(Fs+)=f+Ne  (Fs-)=f-NpB. 3) 
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The test accepts or rejects the null hypothesis on the basis of how big x? is. 
To make this test precise, and give it a ‘significance level’, we have to work 
out what the sampling distribution of x? is, taking into account the fact that The sampling distribution of a 
the four data points are not independent (they satisfy the two constraints statistic is the probability 
Pa, + F4- = N4 and Fg, + Fg- = Np) and the fact that the parameters distribution of its value under 
f+ are not known. These three constraints reduce the number of degrees a eee 
of freedom in the data from four to one. [If you want to learn more about is true. 
computing the ‘number of degrees of freedom’, read a sampling theory book; in 
Bayesian methods we don’t need to know all that, and quantities equivalent to 
the number of degrees of freedom pop straight out of a Bayesian analysis when 
they are appropriate.] These sampling distributions are tabulated by sampling 
theory gnomes and come accompanied by warnings about the conditions under 
which they are accurate. For example, standard tabulated distributions for x? 
are only accurate if the expected numbers F; are about 5 or more. 
Once the data arrive, sampling theorists estimate the unknown parameters 
f+ of the null hypothesis from the data: 








«Fee Pee 2 _ Fa +Fe- 


Bee ee. ee ee 4 
fy Na a Ng ’ f Nat Ng , (37 ) 








and evaluate x7. At this point, the sampling theory school divides itself into 
two camps. One camp uses the following protocol: first, before looking at the 
data, pick the significance level of the test (e.g. 5%), and determine the critical 
value of x? above which the null hypothesis will be rejected. (The significance 
level is the fraction of times that the statistic y? would exceed the critical 
value, if the null hypothesis were true.) Then evaluate x?, compare with the 
critical value, and declare the outcome of the test, and its significance level 
(which was fixed beforehand). 

The second camp looks at the data, finds y?, then looks in the table of 
y?-distributions for the significance level, p, for which the observed value of x? 
would be the critical value. The result of the test is then reported by giving 
this value of p, which is the fraction of times that a result as extreme as the one 
observed, or more extreme, would be expected to arise if the null hypothesis 
were true. 

Let’s apply these two methods. First camp: let’s pick 5% as our signifi- 
cance level. The critical value for x? with one degree of freedom is XA ox = 3.84. 
The estimated values of f+ are 





f+=1/10, f.=9/10. (37.5) 


The expected values of the four measurements are 





(Fa) = 3 (37.6) 
(Fa) = 27 (37.7) 
(Fay) = (37.8) 
(Fp-) = 9 (37.9) 
and x? (as defined in equation (37.1)) is 
xX? = 5.93. (37.10) 


Since this value exceeds 3.84, we reject the null hypothesis that the two treat- 
ments are equivalent at the 0.05 significance level. However, if we use Yates’s 
correction, we find x? = 3.33, and therefore accept the null hypothesis. 
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Camp two runs a finger across the x? table found at the back of any good 
sampling theory book and finds Xid = 2.71. Interpolating between Xo and 
Mass camp two reports ‘the p-value is p = 0.07’. 

Notice that this answer does not say how much more effective A is than B, 
it simply says that A is ‘significantly’ different from B. And here, ‘significant’ 
means only ‘statistically significant’, not practically significant. 

The man in the street, reading the statement that ‘the treatment was sig- 
nificantly different from the control (p = 0.07)’, might come to the conclusion 
that ‘there is a 93% chance that the treatments differ in effectiveness’. But 
what ‘p = 0.07’ actually means is ‘if you did this experiment many times, and 
the two treatments had equal effectiveness, then 7% of the time you would 
find a value of y? more extreme than the one that happened here’. This has 
almost nothing to do with what we want to know, which is how likely it is 
that treatment A is better than B. 


Let me through, I’m a Bayesian 


OK, now let’s infer what we really want to know. We scrap the hypothesis 
that the two treatments have exactly equal effectivenesses, since we do not 
believe it. There are two unknown parameters, p4; and pg+, which are the 
probabilities that people given treatments A and B, respectively, contract the 
disease. 

Given the data, we can infer these two probabilities, and we can answer 
questions of interest by examining the posterior distribution. 

The posterior distribution is 


P({F;} | pa+,pB+)P(pa+, PB+) 


P(pa+,PB+ |{Fi}) = P({Fi}) 


(37.11) 


The likelihood function is 


Na \ Fa, Fa- [ NB \ Fg} Fp 
P({Fi}|pa+,PB+) (re niet @ Pga Pp. (87.12) 


30 10 
( i Johr ( $ ob ob. (37.13) 


What prior distribution should we use? The prior distribution gives us the 
opportunity to include knowledge from other experiments, or a prior belief 
that the two parameters py; and pg,, while different from each other, are 
expected to have similar values. 

Here we will use the simplest vanilla prior distribution, a uniform distri- 
bution over each parameter. 





P(pa+,PB+) = 1. (37.14) 


We can now plot the posterior distribution. Given the assumption of a sepa- 
rable prior on p4+ and pg+, the posterior distribution is also separable: 


P(pa+,pa+|{Fi}) = P(pas| Fas, Fa-)P(pe+| FB+, FB-). (37.15) 





The two posterior distributions are shown in figure 37.1 (except the graphs 

are not normalized) and the joint posterior probability is shown in figure 37.2. 

If we want to know the answer to the question ‘how probable is it that p4+ 

is smaller than pp+?’, we can answer exactly that question by computing the 
posterior probability 

P(pa+ < pp+| Data), (37.16) 


37.1: A medical example 

















1 


H 0.8 
PB+ / 
/ r 0.6 


on 


r 0.4 


Do 


No 
RON 


| 0.2 és Z 4, 
06 08 4 











T T T T 0 


0 02 04 06 08 1 
PA+ 


which is the integral of the joint posterior probability P(p4,,pp+ | Data) 
shown in figure 37.2 over the region in which py+ < ppg+, i.e., the shaded 
triangle in figure 37.3. The value of this integral (obtained by a straightfor- 
ward numerical integration of the likelihood function (37.13) over the relevant 
region) is P(p4;<pp+| Data) = 0.990. 

Thus there is a 99% chance, given the data and our prior assumptions, 
that treatment A is superior to treatment B. In conclusion, according to our 
Bayesian model, the data (1 out of 30 contracted the disease after vaccination 
A, and 3 out of 10 contracted the disease after vaccination B) give very strong 
evidence — about 99 to one — that treatment A is superior to treatment B. 

In the Bayesian approach, it is also easy to answer other relevant questions. 
For example, if we want to know ‘how likely is it that treatment A is ten times 
more effective than treatment B?’, we can integrate the joint posterior proba- 
bility P(pa+,pp+ | Data) over the region in which pa; < 10 pg+ (figure 37.4). 


Model comparison 


If there were a situation in which we really did want to compare the two 
hypotheses Ho: pa+ = pg+ and Hy: pay Æ pp+, we can of course do this 
directly with Bayesian methods also. 

As an example, consider the data set: 


D: One subject, given treatment A, subsequently contracted microsoftus. 
One subject, given treatment B, did not. 
Treatment A 


Got disease 1 
Did not 0 
1 


Total treated 


=|= o| 
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Figure 37.1. Posterior 
probabilities of the two 
effectivenesses. Treatment A — 
solid line; B — dotted line. 


Figure 37.2. Joint posterior 
probability of the two 
effectivenesses — contour plot and 
surface plot. 


PB+ 


09 1 PA+ 


Figure 37.3. The proposition 

PA+ < p+ is true for all points in 
the shaded triangle. To find the 
probability of this proposition we 
integrate the joint posterior 
probability P(pa+,pp+ | Data) 
(figure 37.2) over this region. 


PB+ 





99 1 PA+ 


Figure 37.4. The proposition 
pa+ < 10 ppg+ is true for all points 
in the shaded triangle. 
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How strongly does this data set favour Hı over Ho? 

We answer this question by computing the evidence for each hypothesis. 
Let’s assume uniform priors over the unknown parameters of the models. The 
first hypothesis Ho: pay = pp+ has just one unknown parameter, let’s call it 
D. 

P(p|Ho) =1 pe (0,1). (37.17) 
We’ll use the uniform prior over the two parameters of model Hı that we used 
before: 


P(pa+,pB+|H1)=1 pay € (0,1), pg+ € (0,1). (37.18) 


Now, the probability of the data D under model Ho is the normalizing constant 
from the inference of p given D: 


P(D|Ho) = / dp P(D| p)P(p| Ho) (37.19) 
= foa —p)x1 (37.20) 
= 1/6. (37.21) 


The probability of the data D under model Hı is given by a simple two- 
dimensional integral: 


P(D|Hı) = J fiom dpg+ P(D|pa+,pp+)P(pa+,PB+|H1) (37.22) 


= Javar pas fona — pB+) (37.23) 
= 1/2x1/2 (37.24) 
= 1/4 (37.25) 


Thus the evidence ratio in favour of model H1, which asserts that the two 
effectivenesses are unequal, is 


P(D|M1) 1/4 _ 06 
Fala mee (37.26) 


So if the prior probability over the two hypotheses was 50:50, the posterior 





probability is 60:40 in favour of Hı. o 
Is it not easy to get sensible answers to well-posed questions using Bayesian 
methods? 


[The sampling theory answer to this question would involve the identical 
significance test that was used in the preceding problem; that test would yield 
a ‘not significant’ result. I think it is greatly preferable to acknowledge what 
is obvious to the intuition, namely that the data D do give weak evidence in 
favour of Hı. Bayesian methods quantify how weak the evidence is.] 


> 37.2 Dependence of p-values on irrelevant information 


In an expensive laboratory, Dr. Bloggs tosses a coin labelled a and b twelve 
times and the outcome is the string 


aaabaaaabaab, 


which contains three bs and nine as. 
What evidence do these data give that the coin is biased in favour of a? 
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Dr. Bloggs consults his sampling theory friend who says ‘let r be the num- 
ber of bs and n = 12 be the total number of tosses; I view r as the random 
variable and find the probability of r taking on the value r = 3 or a more 
extreme value, assuming the null hypothesis pa = 0.5 to be true’. He thus 
computes 


P(r < 3|n=12, Ho) 12) + (2) Ta? 


= 0.07, (37.27) 
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and reports ‘at the significance level of 5%, there is not significant evidence 
of bias in favour of a’. Or, if the friend prefers to report p-values rather than 
simply compare p with 5%, he would report ‘the p-value is 7%, which is not 
conventionally viewed as significantly small’. If a two-tailed test seemed more 
appropriate, he might compute the two-tailed area, which is twice the above 
probability, and report ‘the p-value is 15%, which is not significantly small’. 
We won’t focus on the issue of the choice between the one-tailed and two-tailed 
tests, as we have bigger fish to catch. 

Dr. Bloggs pays careful attention to the calculation (37.27), and responds 
‘no, no, the random variable in the experiment was not r: I decided before 
running the experiment that I would keep tossing the coin until I saw three 
bs; the random variable is thus n’. 


Such experimental designs are not unusual. In my experiments on error- 
correcting codes I often simulate the decoding of a code until a chosen number 
r of block errors (bs) has occurred, since the error on the inferred value of log pp 
goes roughly as yr, independent of n. 


Exercise 37.1.!7] Find the Bayesian inference about the bias pa of the coin 
cl given the data, and determine whether a Bayesian’s inferences depend 
on what stopping rule was in force. 


According to sampling theory, a different calculation is required in order 
to assess the ‘significance’ of the result n = 12. The probability distribution 
of n given Ho is the probability that the first n—1 tosses contain exactly r—1 
bs and then the nth toss is a b. 


P(n|Ho,7r) = ay 1/2”, (37.28) 


The sampling theorist thus computes 
P(n > 12| r =3, Ho) = 0.03. (37.29) 


He reports back to Dr. Bloggs, ‘the p-value is 3% — there is significant evidence 
of bias after all!’ 

What do you think Dr. Bloggs should do? Should he publish the result, 
with this marvellous p-value, in one of the journals that insists that all exper- 
imental results have their ‘significance’ assessed using sampling theory? Or 
should he boot the sampling theorist out of the door and seek a coherent 
method of assessing significance, one that does not depend on the stopping 
rule? 

At this point the audience divides in two. Half the audience intuitively 
feel that the stopping rule is irrelevant, and don’t need any convincing that 
the answer to exercise 37.1 (p.463) is ‘the inferences about pa do not depend 
on the stopping rule’. The other half, perhaps on account of a thorough 
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training in sampling theory, intuitively feel that Dr. Bloggs’s stopping rule, 
which stopped tossing the moment the third b appeared, may have biased the 
experiment somehow. If you are in the second group, I encourage you to reflect 
on the situation, and hope you'll eventually come round to the view that is 
consistent with the likelihood principle, which is that the stopping rule is not 
relevant to what we have learned about pa. 

As a thought experiment, consider some onlookers who (in order to save 
money) are spying on Dr. Bloggs’s experiments: each time he tosses the coin, 
the spies update the values of r and n. The spies are eager to make inferences 
from the data as soon as each new result occurs. Should the spies’ beliefs 
about the bias of the coin depend on Dr. Bloggs’s intentions regarding the 
continuation of the experiment? 

The fact that the p-values of sampling theory do depend on the stopping 
rule (indeed, whole volumes of the sampling theory literature are concerned 
with the task of assessing ‘significance’ when a complicated stopping rule is 
required — ‘sequential probability ratio tests’, for example) seems to me a com- 
pelling argument for having nothing to do with p-values at all. A Bayesian 
solution to this inference problem was given in sections 3.2 and 3.3 and exer- 
cise 3.15 (p.59). 

Would it help clarify this issue if I added one more scene to the story? 
The janitor, who’s been eavesdropping on Dr. Bloggs’s conversation, comes in 
and says ‘I happened to notice that just after you stopped doing the experi- 
ments on the coin, the Officer for Whimsical Departmental Rules ordered the 
immediate destruction of all such coins. Your coin was therefore destroyed by 
the departmental safety officer. There is no way you could have continued the 
experiment much beyond n = 12 tosses. Seems to me, you need to recompute 
your p-value?’ 


> 37.3 Confidence intervals 


In an experiment in which data D are obtained from asystem with an unknown 
parameter 0, a standard concept in sampling theory is the idea of a confidence 
interval for 0. Such an interval (@min(D),@max(D)) has associated with it a 
confidence level such as 95% which is informally interpreted as ‘the probability 
that 0 lies in the confidence interval’. 

Let’s make precise what the confidence level really means, then give an 
example. A confidence interval is a function (@min(D), @max(D)) of the data 
set D. The confidence level of the confidence interval is a property that we can 
compute before the data arrive. We imagine generating many data sets from a 
particular true value of 0, and calculating the interval (@min(D), Omax(D)), and 
then checking whether the true value of @ lies in that interval. If, averaging 
over all these imagined repetitions of the experiment, the true value of @ lies 
in the confidence interval a fraction f of the time, and this property holds for 
all true values of 0, then the confidence level of the confidence interval is f. 

For example, if 0 is the mean of a Gaussian distribution which is known 
to have standard deviation 1, and D is a sample from that Gaussian, then 
(Omin(D), @max(D)) = (D—2, D+2) is a 95% confidence interval for 0. 

Let us now look at a simple example where the meaning of the confidence 
level becomes clearer. Let the parameter 0 be an integer, and let the data be 
a pair of points x1, £2, drawn independently from the following distribution: 


V2 «=0 
P(z|@)=¢ 12 x=0+1 (37.30) 


0 for other values of x. 
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For example, if 0 were 39, then we could expect the following data sets: 


D = (#1,%2) = (39,39) with probability 1/4; 
(21,22) = (39,40) with probability 1/4; 
a eae (37.31) 
(21,22) = (40,39) with probability 1/4; 
(21,22) = (40,40) with probability 1/4. 
We now consider the following confidence interval: 
[Omin(D), Omax(D)] = [min(21, v2), min (x1, £2)]. (37.32) 


For example, if (x1, £2) = (40,39), then the confidence interval for 9 would be 
[Omin(D), Omax(D)] = [39, 39]. 

Let’s think about this confidence interval. What is its confidence level? 
By considering the four possibilities shown in (37.31), we can see that there 
is a 75% chance that the confidence interval will contain the true value. The 
confidence interval therefore has a confidence level of 75%, by definition. 

Now, what if the data we acquire are (x1, x2) = (29,29)? Well, we can 
compute the confidence interval, and it is [29,29]. So shall we report this 
interval, and its associated confidence level, 75%? This would be correct 
by the rules of sampling theory. But does this make sense? What do we 
actually know in this case? Intuitively, or by Bayes’ theorem, it is clear that 6 
could either be 29 or 28, and both possibilities are equally likely (if the prior 
probabilities of 28 and 29 were equal). The posterior probability of 0 is 50% 
on 29 and 50% on 28. 

What if the data are (%1,2%2) = (29,30)? In this case, the confidence 
interval is still [29,29], and its associated confidence level is 75%. But in this 
case, by Bayes’ theorem, or common sense, we are 100% sure that 0 is 29. 

In neither case is the probability that 6 lies in the ‘75% confidence interval’ 
equal to 75%! 

Thus 


1. the way in which many people interpret the confidence levels of sampling 
theory is incorrect; 


2. given some data, what people usually want to know (whether they know 
it or not) is a Bayesian posterior probability distribution. 


Are all these examples contrived? Am I making a fuss about nothing? 
If you are sceptical about the dogmatic views I have expressed, I encourage 
you to look at a case study: look in depth at exercise 35.4 (p.446) and the 
reference (Kepler and Oprea, 2001), in which sampling theory estimates and 
confidence intervals for a mutation rate are constructed. Try both methods 
on simulated data — the Bayesian approach based on simply computing the 
likelihood function, and the confidence interval from sampling theory; and let 
me know if you don’t find that the Bayesian answer is always better than the 
sampling theory answer; and often much, much better. This suboptimality 
of sampling theory, achieved with great effort, is why I am passionate about 
Bayesian methods. Bayesian methods are straightforward, and they optimally 
use all the information in the data. 


> 37.4 Some compromise positions 


Let’s end on a conciliatory note. Many sampling theorists are pragmatic — 
they are happy to choose from a selection of statistical methods, choosing 
whichever has the ‘best’ long-run properties. In contrast, I have no problem 
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with the idea that there is only one answer to a well-posed problem; but it’s 
not essential to convert sampling theorists to this viewpoint: instead, we can 
offer them Bayesian estimators and Bayesian confidence intervals, and request 
that the sampling theoretical properties of these methods be evaluated. We 
don’t need to mention that the methods are derived from a Bayesian per- 
spective. If the sampling properties are good then the pragmatic sampling 
theorist will choose to use the Bayesian methods. It is indeed the case that 
many Bayesian methods have good sampling-theoretical properties. Perhaps 
it’s not surprising that a method that gives the optimal answer for each indi- 
vidual case should also be good in the long run! 

Another piece of common ground can be conceded: while I believe that 
most well-posed inference problems have a unique correct answer, which can 
be found by Bayesian methods, not all problems are well-posed. A common 
question arising in data modelling is ‘am I using an appropriate model?’ Model 
criticism, that is, hunting for defects in a current model, is a task that may 
be aided by sampling theory tests, in which the null hypothesis (‘the current 
model is correct’) is well defined, but the alternative model is not specified. 
One could use sampling theory measures such as p-values to guide one’s search 
for the aspects of the model most in need of scrutiny. 


Further reading 


My favourite reading on this topic includes (Jaynes, 1983; Gull, 1988; Loredo, 
1990; Berger, 1985; Jaynes, 2003). Treatises on Bayesian statistics from the 
statistics community include (Box and Tiao, 1973; O’Hagan, 1994). 


> 37.5 Further exercises 


> Exercise 37.2.1] A traffic survey records traffic on two successive days. On 
Friday morning, there are 12 vehicles in one hour. On Saturday morn- 
ing, there are 9 vehicles in half an hour. Assuming that the vehicles are 
Poisson distributed with rates Af and Ag (in vehicles per hour) respec- 
tively, 


(a) is Ag greater than Ap? 


(b) by what factor is Ag bigger or smaller than Ap? 

> Exercise 37.3.20] Write a program to compare treatments A and B given 
data F44, F4_, Fg+, Fg- as described in section 37.1. The outputs 
of the program should be (a) the probability that treatment A is more 
effective than treatment B; (b) the probability that pa+ < 10pp4; (c) 
the probability that pp+ < 10 pa+. 
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Neural networks 
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38 


Introduction to Neural Networks 


In the field of neural networks, we study the properties of networks of idealized 
‘neurons’. 
Three motivations underlie work in this broad and interdisciplinary field. 


Biology. The task of understanding how the brain works is one of the out- 
standing unsolved problems in science. Some neural network models are 
intended to shed light on the way in which computation and memory 
are performed by brains. 


Engineering. Many researchers would like to create machines that can ‘learn’, 
perform ‘pattern recognition’ or ‘discover patterns in data’. 


Complex systems. A third motivation for being interested in neural net- 
works is that they are complex adaptive systems whose properties are 
interesting in their own right. 


I should emphasize several points at the outset. 


e This book gives only a taste of this field. There are many interesting 
neural network models which we will not have time to touch on. 


e The models that we discuss are not intended to be faithful models of 
biological systems. If they are at all relevant to biology, their relevance 
is on an abstract level. 


e I will describe some neural network methods that are widely used in 
nonlinear data modelling, but I will not be able to give a full description 
of the state of the art. If you wish to solve real problems with neural 
networks, please read the relevant papers. 


> 38.1 Memories 


In the next few chapters we will meet several neural network models which 
come with simple learning algorithms which make them function as memories. 
Perhaps we should dwell for a moment on the conventional idea of memory 
in digital computation. A memory (a string of 5000 bits describing the name 
of a person and an image of their face, say) is stored in a digital computer 
at an address. To retrieve the memory you need to know the address. The 
address has nothing to do with the memory itself. Notice the properties that 
this scheme does not have: 


1. Address-based memory is not associative. Imagine you know half of a 
memory, say someone’s face, and you would like to recall the rest of the 


468 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


38.1: Memories 469 


memory — their name. If your memory is address-based then you can’t 
get at a memory without knowing the address. [Computer scientists have 
devoted effort to wrapping traditional address-based memories inside 
cunning software to produce content-addressable memories, but content- 
addressability does not come naturally. It has to be added on.] 


2. Address-based memory is not robust or fault-tolerant. If a one-bit mis- 
take is made in specifying the address then a completely different mem- 
ory will be retrieved. If one bit of a memory is flipped then whenever 
that memory is retrieved the error will be present. Of course, in all mod- 
ern computers, error-correcting codes are used in the memory, so that 
small numbers of errors can be detected and corrected. But this error- 
tolerance is not an intrinsic property of the memory system. If minor 
damage occurs to certain hardware that implements memory retrieval, 
it is likely that all functionality will be catastrophically lost. 


3. Address-based memory is not distributed. In a serial computer that 
is accessing a particular memory, only a tiny fraction of the devices 
participate in the memory recall: the CPU and the circuits that are 
storing the required byte. All the other millions of devices in the machine 
are sitting idle. 


Are there models of truly parallel computation, in which multiple de- 
vices participate in all computations? [Present-day parallel computers 
scarcely differ from serial computers from this point of view. Memory 
retrieval works in just the same way, and control of the computation 
process resides in CPUs. There are simply a few more CPUs. Most of 
the devices sit idle most of the time.] 


Biological memory systems are completely different. 


1. Biological memory is associative. Memory recall is content-addressable. 
Given a person’s name, we can often recall their face; and vice versa. 
Memories are apparently recalled spontaneously, not just at the request 


of some CPU. 
2. Biological memory recall is error-tolerant and robust. 


e Errors in the cues for memory recall can be corrected. An example 
asks you to recall ‘An American politician who was very intelligent 
and whose politician father did not like broccoli’. Many people 
think of president Bush — even though one of the cues contains an 
error. 


e Hardware faults can also be tolerated. Our brains are noisy lumps 
of meat that are in a continual state of change, with cells being 
damaged by natural processes, alcohol, and boxing. While the cells 
in our brains and the proteins in our cells are continually changing, 
many of our memories persist unaffected. 


3. Biological memory is parallel and distributed — not completely distributed 
throughout the whole brain: there does appear to be some functional 
specialization — but in the parts of the brain where memories are stored, 
it seems that many neurons participate in the storage of multiple mem- 
ories. 


These properties of biological memory systems motivate the study of ‘arti- 
ficial neural networks’ — parallel distributed computational systems consisting 
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of many interacting simple elements. The hope is that these model systems 
might give some hints as to how neural computation is achieved in real bio- 
logical neural networks. 


»> 38.2 Terminology 


Each time we describe a neural network algorithm we will typically specify 
three things. [If any of this terminology is hard to understand, it’s probably 
best to dive straight into the next chapter.] 


Architecture. The architecture specifies what variables are involved in the 
network and their topological relationships — for example, the variables 
involved in a neural net might be the weights of the connections between 
the neurons, along with the activities of the neurons. 


Activity rule. Most neural network models have short time-scale dynamics: 
local rules define how the activities of the neurons change in response 
to each other. Typically the activity rule depends on the weights (the 
parameters) in the network. 


Learning rule. The learning rule specifies the way in which the neural net- 
work’s weights change with time. This learning is usually viewed as 
taking place on a longer time scale than the time scale of the dynamics 
under the activity rule. Usually the learning rule will depend on the 
activities of the neurons. It may also depend on the values of target 
values supplied by a teacher and on the current value of the weights. 


Where do these rules come from? Often, activity rules and learning rules are 
invented by imaginative researchers. Alternatively, activity rules and learning 
rules may be derived from carefully chosen objective functions. 

Neural network algorithms can be roughly divided into two classes. 


Supervised neural networks are given data in the form of inputs and tar- 
gets, the targets being a teacher’s specification of what the neural net- 
work’s response to the input should be. 


Unsupervised neural networks are given data in an undivided form — sim- 
ply a set of examples {x}. Some learning algorithms are intended simply 
to memorize these data in such a way that the examples can be recalled 
in the future. Other algorithms are intended to ‘generalize’, to discover 
‘patterns’ in the data, or extract the underlying ‘features’ from them. 


Some unsupervised algorithms are able to make predictions — for exam- 
ple, some algorithms can ‘fill in’ missing variables in an example x — and 
so can also be viewed as supervised networks. 
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The Single Neuron as a Classifier 


»> 39.1 The single neuron 


We will study a single neuron for two reasons. First, many neural network 

models are built out of single neurons, so it is good to understand them in 

detail. And second, a single neuron is itself capable of ‘learning’ — indeed, y 
various standard statistical methods can be viewed in terms of single neurons T 

— so this model will serve as a first example of a supervised neural network. 


Definition of a single neuron 1 4 


We will start by defining the architecture and the activity rule of a single i í 


neuron, and we will then derive a learning rule. Figure 39.1. A single neuron 


Architecture. A single neuron has a number I of inputs x; and one output 
which we will here call y. (See figure 39.1.) Associated with each input 
is a weight w; (i = 1,...,I). There may be an additional parameter 
wo of the neuron called a bias which we may view as being the weight 
associated with an input xo that is permanently set to 1. The single 
neuron is a feedforward device — the connections are directed from the 
inputs to the output of the neuron. 


Activity rule. The activity rule has two steps. 


1. First, in response to the imposed inputs x, we compute the activa- 
tion of the neuron, 


a= So wizi, (39.1) 


where the sum is over i= 0,...,/ if there is a bias and i = 1,..., I 
otherwise. 


2. Second, the output y is set as a function f(a) of the activation. 
The output is also called the activity of the neuron, not to be 
confused with the activation a. There are several possible activation 
functions; here are the most popular. 


activation activity 
a > yla) 


(a) Deterministic activation functions: 


i. Linear. 

yla) =a. (39.2) 
ii. Sigmoid (logistic function). 1 = 
1 0 e I 
y(a) = oa (y € (0,1)). (39.3) 5 o0 5 
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iii. Sigmoid (tanh). 1 


aa 
` 








y(a) = tanh (a) (y € (—1,1)). (39.4) k sf ; : 

iv. Threshold function. 1 

o 
oe E 
a= 7 Zo 695) TE 

(b) Stochastic activation functions: y is stochastically selected from 
+1. 
i. Heat bath. 


: a 1 
wae 1 with probability Tea (39.6) 


—1 otherwise. 


ii. The Metropolis rule produces the output in a way that 
depends on the previous output state y: 


Compute A = ay 
If A <0, flip y to the other state 
Else flip y to the other state with probability e~4. 


»> 39.2 Basic neural network concepts 


A neural network implements a function y(x; w); the ‘output’ of the network, 
y, is a nonlinear function of the ‘inputs’ x; this function is parameterized by 
‘weights’ w. 

We will study a single neuron which produces an output between 0 and 1 
as the following function of x: 


1 


= 39.7 
1l+e wx ( ) 


y(x; w) 


~~ Exercise 39.1. [1] In what contexts have we encountered the function y(x;w) = 
= 1/(1 + e7™*) already? 


Motivations for the linear logistic function 


In section 11.2 we studied ‘the best detection of pulses’, assuming that one 
of two signals xp and x, had been transmitted over a Gaussian channel with 
variance-covariance matrix A~!. We found that the probability that the source 
signal was s=1 rather than s=0, given the received signal y, was 


Pets TST (39.8) 


where a(y) was a linear function of the received vector, 
aly) = w'y +9, (39.9) 
with w = A (xı — Xo). 


The linear logistic function can be motivated in several other ways — see 
the exercises. 
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Figure 39.2. Output of a simple 
neural network as a function of its 
input. 





Input space and weight space 


For convenience let us study the case where the input vector x and the param- 
eter vector w are both two-dimensional: x = (x1, x2), w = (w1, w2). Then we 
can spell out the function performed by the neuron thus: 


1 


~ Tp ewer waa) (39.10) 


y(x; w) 
Figure 39.2 shows the output of the neuron as a function of the input vector, 
for w = (0,2). The two horizontal axes of this figure are the inputs x; and 22, 
with the output y on the vertical axis. Notice that on any line perpendicular 
to w, the output is constant; and along a line in the direction of w, the output 
is a sigmoid function. 

We now introduce the idea of weight space, that is, the parameter space of 
the network. In this case, there are two parameters wı and we, so the weight 
space is two dimensional. This weight space is shown in figure 39.3. For a 
selection of values of the parameter vector w, smaller inset figures show the 
function of x performed by the network when w is set to those values. Each of 
these smaller figures is equivalent to figure 39.2. Thus each point in w space 
corresponds to a function of x. Notice that the gain of the sigmoid function 
(the gradient of the ramp) increases as the magnitude of w increases. 

Now, the central idea of supervised neural networks is this. Given examples 
of a relationship between an input vector x, and a target t, we hope to make 
the neural network ‘learn’ a model of the relationship between x and t. A 
successfully trained network will, for any given x, give an output y that is 
close (in some sense) to the target value t. Training the network involves 
searching in the weight space of the network for a value of w that produces a 
function that fits the provided training data well. 

Typically an objective function or error function is defined, as a function 
of w, to measure how well the network with weights set to w solves the task. 
The objective function is a sum of terms, one for each input /target pair {x,t}, 
measuring how close the output y(x; w) is to the target t. The training process 
is an exercise in function minimization — i.e., adjusting w in such a way as to 
find a w that minimizes the objective function. Many function-minimization 
algorithms make use not only of the objective function, but also its gradient 
with respect to the parameters w. For general feedforward neural networks 
the backpropagation algorithm efficiently evaluates the gradient of the output 
y with respect to the parameters w, and thence the gradient of the objective 
function with respect to w. 
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Figure 39.3. Weight space. 
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> 39.3 Training the single neuron as a binary classifier 


We assume we have a data set of inputs {x }_, with binary labels {t( }4_,, 


and a neuron whose output y(x; w) is bounded between 0 and 1. We can then 
write down the following error function: 


Gw)=-S° je nyx; w) + (1—t™) n( — y(x™;w))|. (39.11) 


n 


Each term in this objective function may be recognized as the information 
content of one outcome. It may also be described as the relative entropy be- 
tween the empirical probability distribution t, 1- i) and the probability 
distribution implied by the output of the neuron (y, 1—y). The objective func- 
tion is bounded below by zero and only attains this value if y(x™; w) = t) 
for all n. 

We now differentiate this objective function with respect to w. 


Exercise 39.2.!?] The backpropagation algorithm. Show that the derivative g = 
= OG /Ow is given by: 


ag 


= =Y e — yy 2, (39.12) 
Ow; 


Jj 


Notice that the quantity e = ¢™ — y( is the error on example n — the 
difference between the target and the output. The simplest thing to do with 
a gradient of an error function is to descend it (even though this is often di- 
mensionally incorrect, since a gradient has dimensions [1/parameter], whereas 
a change in a parameter has dimensions [parameter]). Since the derivative 
OG /Ow is a sum of terms g™ defined by 


g =- — ya (39.13) 


for n = 1,...,.N, we can obtain a simple on-line algorithm by putting each 
input through the network one at a time, and adjusting w a little in a direction 
opposite to go. 

We summarize the whole learning algorithm. 


The on-line gradient-descent learning algorithm 


Architecture. A single neuron has a number I of inputs x; and one output 
y. Associated with each input is a weight w; (i = 1,..., I). 


Activity rule. 1. First, in response to the received inputs x (which may be 
arbitrary real numbers), we compute the activation of the neuron, 


a= X uiz, (39.14) 


where the sum is over i = 0,..., I if there is a bias and i = 1,..., I 
otherwise. 


2. Second, the output y is set as a sigmoid function of the activation. 


1 


1+ e~et Cay) 


yla) 


This output might be viewed as stating the probability, according to the 
neuron, that the given input is in class 1 rather than class 0. 
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Learning rule. The teacher supplies a target value t € {0,1} which says 
what the correct answer is for the given input. We compute the error 
signal 


e=t-y (39.16) 


then adjust the weights w in a direction that would reduce the magnitude 
of this error: 


Aw; = nexi, (39.17) 


where 77 is the ‘learning rate’. Commonly 7 is set by trial and error to a 
constant value or to a decreasing function of simulation time 7 such as 


no/T. 


The activity rule and learning rule are repeated for each input/target pair 
(x,t) that is presented. If there is a fixed data set of size N, we can cycle 
through the data multiple times. 


Batch learning versus on-line learning 


Here we have described the on-line learning algorithm, in which a change in 
the weights is made after every example is presented. An alternative paradigm 
is to go through a batch of examples, computing the outputs and errors and 
accumulating the changes specified in equation (39.17) which are then made 
at the end of the batch. 


Batch learning for the single neuron classifier 


For each input /target pair (x) ,¢™) (n La compute 
y™ = y(x™; w), where 
1 


Se 1 
1 + exp(— 0, wizi)” (39.18) 


y(x; w) 


define e™ = t™) — y(, and compute for each weight w; 


Then let 


Aw; = -n 5 a”. 





This batch learning algorithm is a gradient descent algorithm, whereas the 
on-line algorithm is a stochastic gradient descent algorithm. Source code 
implementing batch learning is given in algorithm 39.5. This algorithm is 
demonstrated in figure 39.4 for a neuron with two inputs with weights w, and 
wg and a bias wo, performing the function 


1 


a eS O (39.21) 


y(x; w) 


The bias wọ is included, in contrast to figure 39.3, where it was omitted. The 
neuron is trained on a data set of ten labelled examples. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


39.3: Training the single neuron as a binary classifier 477 





10 T T T T 3 t i 
W2 2.5 L 4 


xX 2+ 4 
6 a x x 4 15 b | 2 











o Tip J7 











4 
2f o J Op 7 6 w2 ----- 
0 


EFELER n 


"0.5005 1 1.5 2 2.5 3W1 gb J 





0 2 4 6 8 10x, 











-12 1 1 1 1 
1 10 100 1000 10000 100000 


7 T T T T 
G(w) 





























0 1 1 1 1 
1 10 100 1000 10000 100000 





















































0 L = L L 
1 10 100 1000 10000 100000 


Figure 39.4. A single neuron learning to classify by gradient descent. The neuron has two weights w1 
and wz and a bias wo. The learning rate was set to 7 = 0.01 and batch-mode gradient 
descent was performed using the code displayed in algorithm 39.5. (a) The training data. 
(b) Evolution of weights wo, wı and w3 as a function of number of iterations (on log scale). 
(c) Evolution of weights w; and w2 in weight space. (d) The objective function G(w) as a 
function of number of iterations. (e) The magnitude of the weights Ew (w) as a function of 
time. (f-k) The function performed by the neuron (shown by three of its contours) after 30, 
80, 500, 3000, 10000 and 40 000 iterations. The contours shown are those corresponding to 
a = 0,+1, namely y = 0.5,0.27 and 0.73. Also shown is a vector proportional to (w1, we). 
The larger the weights are, the bigger this vector becomes, and the closer together are the 
contours. 
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global x ; 
global t ; 


# x is an N * I matrix containing all the input vectors 
# t is a vector of length N containing all the targets 


for 1 = 1:L # loop L times 
a=x*w ; compute all activations 

y = sigmoid(a) ; compute outputs 

ect ay ; compute errors 

g=-xr *e; compute the gradient vector 

w 


= ; make step, using learning rate eta 


= w- eta * ( g + alpha * w ) 
and weight decay alpha 
endfor 


function f = sigmoid ( v ) 
f£=1.0 ./ (1.0 .t+ exp (-v)); 
endfunction 
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Algorithm 39.5. Octave source 
code for a gradient descent 
optimizer of a single neuron, 
batch learning, with optional 
weight decay (rate alpha). 
Octave notation: the instruction 
a = x * w causes the (N x I) 
matriz x consisting of all the 
input vectors to be multiplied by 
the weight vector w, giving the 
vector a listing the activations for 
all N input vectors; x? means 
x-transpose; the single command 
y = sigmoid(a) computes the 
sigmoid function of all elements of 
the vector a. 


Figure 39.6. The influence of 
weight decay on a single neuron’s 
learning. The objective function is 
M(w) = G(w) + aEw(w). The 
learning method was as in 

figure 39.4. (a) Evolution of 
weights wo, w1 and w2. (b) 
Evolution of weights wı and w2 in 
weight space shown by points, 
contrasted with the trajectory 
followed in the case of zero weight 
decay, shown by a thin line (from 
figure 39.4). Notice that for this 
problem weight decay has an 
effect very similar to ‘early 
stopping’. (c) The objective 
function M(w) and the error 
function G(w) as a function of 
number of iterations. (d) The 
function performed by the neuron 
after 40 000 iterations. 
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> 39.4 Beyond descent on the error function: regularization 


If the parameter 7 is set to an appropriate value, this algorithm works: the 
algorithm finds a setting of w that correctly classifies as many of the examples 
as possible. 

If the examples are in fact linearly separable then the neuron finds this lin- 
ear separation and its weights diverge to ever-larger values as the simulation 
continues. This can be seen happening in figure 39.4(f-k). This is an exam- 
ple of overfitting, where a model fits the data so well that its generalization 
performance is likely to be adversely affected. 

This behaviour may be viewed as undesirable. How can it be rectified? 

An ad hoc solution to overfitting is to use early stopping, that is, use 
an algorithm originally intended to minimize the error function G(w), then 
prevent it from doing so by halting the algorithm at some point. 

A more principled solution to overfitting makes use of regularization. Reg- 
ularization involves modifying the objective function in such a way as to in- 
corporate a bias against the sorts of solution w which we dislike. In the above 
example, what we dislike is the development of a very sharp decision bound- 
ary in figure 39.4k; this sharp boundary is associated with large weight values, 
so we use a regularizer that penalizes large weight values. We modify the 
objective function to: 


M(w) = G(w) + aEw(w) (39.22) 


where the simplest choice of regularizer is the weight decay regularizer 
1 2 
Eww) = 5 > we. (39.23) 
7 


The regularization constant a is called the weight decay rate. This additional 
term favours small values of w and decreases the tendency of a model to overfit 
fine details of the training data. The quantity a is known as a hyperparameter. 
Hyperparameters play a role in the learning algorithm but play no role in the 
activity rule of the network. 


Exercise 39.3.11] Compute the derivative of M(w) with respect to w;. Why is 
> the above regularizer known as the ‘weight decay’ regularizer? 


The gradient descent source code of algorithm 39.5 implements weight decay. 
This gradient descent algorithm is demonstrated in figure 39.6 using weight 
decay rates a = 0.01, 0.1, and 1. As the weight decay rate is increased 
the solution becomes biased towards broader sigmoid functions with decision 
boundaries that are closer to the origin. 


Note 


Gradient descent with a step size 7 is in general not the most efficient way to 
minimize a function. A modification of gradient descent known as momentum, 
while improving convergence, is also not recommended. Most neural network 
experts use more advanced optimizers such as conjugate gradient algorithms. 
[Please do not confuse momentum, which is sometimes given the symbol a, 
with weight decay.] 
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> 39.5 Further exercises 
More motivations for the linear neuron 


> Exercise 39.4./7] Consider the task of recognizing which of two Gaussian distri- 
butions a vector z comes from. Unlike the case studied in section 11.2, 
where the distributions had different means but a common variance- 
covariance matrix, we will assume that the two distributions have ex- 
actly the same mean but different variances. Let the probability of z 
given s (s € {0,1}) be 
I 
P(z|s) = II Normal (z;;0,02,), (39.24) 
i=1 


where o2 is the variance of z; when the source symbol is s. Show that 


P(s=1 |z) can be written in the form 


1 


P(s=1 Oo Å |Á—— 
=la) 1 + exp(—w'x + 6)’ 


(39.25) 


where 2; is an appropriate function of z;, x; = g(z). 


Exercise 39.5.!7] The noisy LED. 


¢ 2 
an e(2) = oA (3) = -l eg) = VS 
5 6 JN -l 4 
(TES 


Consider an LED display with 7 elements numbered as shown above. The 
state of the display is a vector x. When the controller wants the display 
to show character number s, e.g. s=2, each element x; (j = 1,...,7) 
either adopts its intended state c;(s), with probability 1— f, or is flipped, 
with probability f. Let’s call the two states of x ‘+1’ and ‘—-1’. 


(a) Assuming that the intended character s is actually a 2 or a 3, what 
is the probability of s, given the state x? Show that P(s=2|x) 
can be written in the form 





P(s=2|x) = : 


= 2 
1+ exp(—w'x + 6)’ oe) 


and compute the values of the weights w in the case f = 0.1. 


(b) Assuming that s is one of {0,1,2,...,9}, with prior probabilities 
Ps, What is the probability of s, given the state x? Put your answer 


in the form A 


e 


3 
X els 


s! 





P(s|x) = (39.27) 


where {as} are functions of {c;(s)} and x. 


Could you make a better alphabet of 10 characters for a noisy LED, i.e., 
an alphabet less susceptible to confusion? 


> Exercise 39.6.!7] A (3, 1) error-correcting code consists of the two codewords 
x“) = (1,0,0) and x® = (0,0,1). A source bit s € {1,2} having proba- 
bility distribution {p1, p2} is used to select one of the two codewords for 
transmission over a binary symmetric channel with noise level f. The 
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a! 


0 

1 oh 
2 7 
3 P 
4 4 
54 
6 a 
7 1 
8 B 
9 E 
10 È 
11 £ 
12 U 
13 2 
14 d 


Table 39.7. An alternative 
15-character alphabet for the 
7-element LED display. 
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received vector is r. Show that the posterior probability of s given r can 
be written in the form 


1 


P(s=1 |r) = 
1+exp (—wo — ae wn?) 





, 


and give expressions for the coefficients {wn} and the bias, wọ. 


Describe, with a diagram, how this optimal decoder can be expressed in 
terms of a ‘neuron’. 
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Problems to look at before Chapter 40 


> Exercise 40.1.'7] What is 5X (X)? 


[The symbol PA means the combination i 


> Exercise 40.2.12] If the top row of Pascal’s triangle (which contains the single 
number ‘1’) is denoted row zero, what is the sum of all the numbers in 
the triangle above row N? 

> Exercise 40.3.12] 3 points are selected at random on the surface of a sphere. 
What is the probability that all of them lie on a single hemisphere? 


This chapter’s material is originally due to Polya (1954) and Cover (1965) and 
the exposition that follows is Yaser Abu-Mostafa’s. 
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Capacity of a Single Neuron 


Learning A Figure 40.1. Neural network 
N N 
{tn}n=1 algorithm W wW, {tn}n=1 learning viewed as 

| communication. 


{xn eet {xn} 


> 40.1 Neural network learning as communication 


Many neural network models involve the adaptation of a set of weights w in 
response to a set of data points, for example a set of N target values Dy = 
{t,}4_, at given locations {x,}‘_,. The adapted weights are then used to 
process subsequent input data. This process can be viewed as a communication 
process, in which the sender examines the data Dy and creates a message w 
that depends on those data. The receiver then uses w; for example, the 
receiver might use the weights to try to reconstruct what the data Dy was. 
[In neural network parlance, this is using the neuron for ‘memory’ rather than 
for ‘generalization’; ‘generalizing’ means extrapolating from the observed data 
to the value of ty+1 at some new location xy+41.] Just as a disk drive is a 
communication channel, the adapted network weights w therefore play the 
role of a communication channel, conveying information about the training 
data to a future user of that neural net. The question we now address is, 
‘what is the capacity of this channel?’ — that is, ‘how much information can 
be stored by training a neural network?’ 

If we had a learning algorithm that either produces a network whose re- 
sponse to all inputs is +1 or a network whose response to all inputs is 0, 
depending on the training data, then the weights allow us to distinguish be- 
tween just two sorts of data set. The maximum information such a learning 
algorithm could convey about the data is therefore 1 bit, this information con- 
tent being achieved if the two sorts of data set are equiprobable. How much 
more information can be conveyed if we make full use of a neural network’s 
ability to represent other functions? 


»> 40.2 The capacity of a single neuron 


We will look at the simplest case, that of a single binary threshold neuron. We 
will find that the capacity of such a neuron is two bits per weight. A neuron 
with K inputs can store 2K bits of information. 

To obtain this interesting result we lay down some rules to exclude less 
interesting answers, such as: ‘the capacity of a neuron is infinite, because each 


483 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


484 40 — Capacity of a Single Neuron 


of its weights is a real number and so can convey an infinite number of bits’. 
We exclude this answer by saying that the receiver is not able to examine the 
weights directly, nor is the receiver allowed to probe the weights by observing 
the output of the neuron for arbitrarily chosen inputs. We constrain the 
receiver to observe the output of the neuron at the same fixed set of N points 
{Xn} that were in the training set. What matters now is how many different 
distinguishable functions our neuron can produce, given that we can observe 
the function only at these N points. How many different binary labellings of 
N points can a linear threshold function produce? And how does this number 
compare with the maximum possible number of binary labellings, 2%? If 
nearly all of the 2% labellings can be realized by our neuron, then it is a 
communication channel that can convey all N bits (the target values {tn}) 
with small probability of error. We will identify the capacity of the neuron as 
the maximum value that N can have such that the probability of error is very 
small. [We are departing a little from the definition of capacity in Chapter 9.] 

We thus examine the following scenario. The sender is given a neuron 
with K inputs and a data set Dy which is a labelling of N points. The 
sender uses an adaptive algorithm to try to find a w that can reproduce this 
labelling exactly. We will assume the algorithm finds such a w if it exists. The 
receiver then evaluates the threshold function on the N input values. What 
is the probability that all N bits are correctly reproduced? How large can N 
become, for a given K, without this probability becoming substantially less 
than one? 


General position 


One technical detail needs to be pinned down: what set of inputs {Xn} are we 
considering? Our answer might depend on this choice. We will assume that 
the points are in general position. 


Definition 40.1 A set of points {Xn} in K-dimensional space are in general 
position if any subset of size < K is linearly independent, and no K +1 of 
them lie in a (K — 1)-dimensional plane. 


In K = 3 dimensions, for example, a set of points are in general position if no 
three points are colinear and no four points are coplanar. The intuitive idea is 
that points in general position are like random points in the space, in terms of 
the linear dependences between points. You don’t expect three random points 
in three dimensions to lie on a straight line. 


The linear threshold function 


The neuron we will consider performs the function 


K 
y=f (>: nuns) (40.1) 
k=1 
where 


HOS a (40.2) 


We will not have a bias wo; the capacity for a neuron with a bias can be 
obtained by replacing K by K + 1 in the final result below, i.e., considering 
one of the inputs to be fixed to 1. (These input points would not then be in 
general position; the derivation still works.) 
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w2 


Figure 40.2. One data point in a 
two-dimensional input space, and 
the two regions of weight space 
that give the two alternative 
labellings of that point. 





> 40.3 Counting threshold functions 


Let us denote by T(N, K) the number of distinct threshold functions on N 
points in general position in K dimensions. We will derive a formula for 
T(N, K). 

To start with, let us work out a few cases by hand. 


In K = 1 dimension, for any N 


The N points lie on a line. By changing the sign of the one weight wi we can 
label all points on the right side of the origin 1 and the others 0, or vice versa. 
Thus there are two distinct threshold functions. T(N,1) = 2. 


With N = 1 point, for any K 


If there is just one point x“) then we can realize both possible labellings by 
setting w = +x". Thus T(1, K) = 2. 





In K =2 dimensions 


In two dimensions with N points, we are free to spin the separating line around 
the origin. Each time the line passes over a point we obtain a new function. 
Once we have spun the line through 360 degrees we reproduce the function 
we started from. Because the points are in general position, the separating 
plane (line) crosses only one point at a time. In one revolution, every point 
is passed over twice. There are therefore 2N distinct threshold functions. 
T(N,2) =2N. 

Comparing with the total number of binary functions, 2%, we may note 
that for N > 3, not all binary functions can be realized by a linear threshold 
function. One famous example of an unrealizable function with N = 4 and 
K = 2 is the exclusive-or function on the points x = (+1,+1). [These points 
are not in general position, but you may confirm that the function remains 
unrealizable even if the points are perturbed into general position.| 














In K = 2 dimensions, from the point of view of weight space 


There is another way of visualizing this problem. Instead of visualizing a 
plane separating points in the two-dimensional input space, we can consider 
the two-dimensional weight space, colouring regions in weight space different 
colours if they label the given datapoints differently. We can then count the 
number of threshold functions by counting how many distinguishable regions 
there are in weight space. Consider first the set of weight vectors in weight 
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Figure 40.3. Two data points in a 
two-dimensional input space, and 
the four regions of weight space 
that give the four alternative 
labellings. 


Figure 40.4. Three data points in 
a two-dimensional input space, 
and the six regions of weight 
space that give alternative 
labellings of those points. In this 
case, the labellings (0,0,0) and 
(1, 1,1) cannot be realized. For 
any three points in general 
position there are always two 
labellings that cannot be realized. 








space that classify a particular example x) as a 1. For example, figure 40.2a 
shows a single point in our two-dimensional x-space, and figure 40.2b shows 
the two corresponding sets of points in w-space. One set of weight vectors 
occupy the half space 


x .w > 0, (40.3) 


and the others occupy x‘)-w < 0. In figure 40.3a we have added a second 
point in the input space. There are now 4 possible labellings: (1,1), (1,0), 
(0,1), and (0,0). Figure 40.3b shows the two hyperplanes x.w = 0 and 
x(2)-w = 0 which separate the sets of weight vectors that produce each of 
these labellings. When N = 3 (figure 40.4), weight space is divided by three 
hyperplanes into six regions. Not all of the eight conceivable labellings can be 


realized. Thus T(3, 2) = 6. 


In K =3 dimensions 


We now use this weight space visualization to study the three dimensional 
case. 

Let us imagine adding one point at a time and count the number of thresh- 
old functions as we do so. When N = 2, weight space is divided by two hy- 
perplanes x)-w = 0 and x)-w = 0 into four regions; in any one region all 
vectors w produce the same function on the 2 input vectors. Thus T(2,3) = 4. 

Adding a third point in general position produces a third plane in w space, 
so that there are 8 distinguishable regions. T(3,3) = 8. The three bisecting 
planes are shown in figure 40.5a. 

At this point matters become slightly more tricky. As figure 40.5b illus- 
trates, the fourth plane in the three-dimensional w space cannot transect all 
eight of the sets created by the first three planes. Six of the existing regions 
are cut in two and the remaining two are unaffected. So T(4,3) = 14. Two 
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Figure 40.5. Weight space 
illustrations for T(3,3) and 
T(4,3). (a) T(3,3) = 8. Three 
hyperplanes (corresponding to 
three points in general position) 
divide 3-space into 8 regions, 
shown here by colouring the 
relevant part of the surface of a 
hollow, semi-transparent cube 
centred on the origin. (b) 
T(4,3) = 14. Four hyperplanes 
divide 3-space into 14 regions, of 
which this figure shows 13 (the 
14th region is out of view on the 
right-hand face. Compare with 
figure 40.5a: all of the regions 
that are not coloured white have 
been cut into two. 








K 
N 1 2 3 4 5 6 7 8 
1 De 2n - 429 2 Oy AD e2 o 2 
2 2 4 4 
3 2 6 8 
4 2 8 14 
5 2 10 
6 2 12 Table 40.6. Values of T(N, K) 


deduced by hand. 


Figure 40.7. Illustration of the 
cutting process going from T(3, 3) 
to T(4,3). (a) The eight regions 
of figure 40.5a with one added 
hyperplane. All of the regions 
that are not coloured white have 
been cut into two. (b) Here, the 
hollow cube has been made solid, 
an so we can see which regions are 
SSSsSS8 ie cut by the fourth plane. The front 
ae ~s half of the cube has been cut 
away. (c) This figure shows the 
new two dimensional hyperplane, 
which is divided into six regions 
by the three one-dimensional 
hyperplanes (lines) which cross it. 
Each of these regions corresponds 
to one of the three-dimensional 
regions in figure 40.7a which is 
cut into two by this new 
hyperplane. This shows that 
T (4,3) — T(3,3) = 6. Figure 40.7c 
should be compared with figure 
40.4b. 
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of the binary functions on 4 points in 3 dimensions cannot be realized by a 
linear threshold function. 

We have now filled in the values of T(N, K) shown in table 40.6. Can we 
obtain any insights into our derivation of T(4,3) in order to fill in the rest of 
the table for T(N, K)? Why was T(4,3) greater than T(3,3) by six? 

Six is the number of regions that the new hyperplane bisected in w-space 
(figure 40.7ab). Equivalently, if we look in the AK —1 dimensional subspace 
that is the Nth hyperplane, that subspace is divided into six regions by the 
N-1 previous hyperplanes (figure 40.7c). Now this is a concept we have met 
before. Compare figure 40.7c with figure 40.4b. How many regions are created 
by N — 1 hyperplanes in a K—1 dimensional space? Why, T(N-—1, K—1), of 
course! In the present case N = 4, K = 3, we can look up T(3,2) = 6 in the 
previous section. So 


T(4,3) = T(3,3) + T(3,2). (40.4) 


Recurrence relation for any N, K 


Generalizing this picture, we see that when we add an Nth hyperplane in K 
dimensions, it will bisect T(N—1, K—1) of the T(N—1, K) regions that were 
created by the previous N—1 hyperplanes. Therefore, the total number of 
regions obtained after adding the Nth hyperplane is 2T(N—1, K—1) (since 
T(N-1, K—1) out of T(N—1, K) regions are split in two) plus the remaining 
T(N-1, kK) -—T(N-1, K-1) regions not split by the Nth hyperplane, which 
gives the following equation for T(N, K): 





T(N, K) =T(N-1,K)+T(N-1,K-1). (40.5) 





Now all that remains is to solve this recurrence relation given the boundary 
conditions T(N,1) = 2 and T(1, K) = 2. 

Does the recurrence relation (40.5) look familiar? Maybe you remember 
building Pascal’s triangle by adding together two adjacent numbers in one row 
to get the number below. The N, K element of Pascal’s triangle is equal to 


N N! 
C(N,K) = e = T- K)IRT (40.6) 


K Table 40.8. Pascal’s triangle. 
23 45 6 7 


z 
D 


0 1 
1 1 1 
2 1 2 1 
3 1 3 3 1 
4 14641 
5 1 5 10105 1 

Combinations (X) satisfy the equation 

C(N,K) = C(N-1,kK-1)+ C(N-1,4), for all N > 0. (40.7) 





[Here we are adopting the convention that (2) =OifkK >NorkK <0] 
So (>) satisfies the required recurrence relation (40.5). This doesn’t mean 
T(N,K) = eas since many functions can satisfy one recurrence relation. 
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Figure 40.9. The fraction of functions on N points in K dimensions that are linear threshold functions, 
T(N, K)/2%, shown from various viewpoints. In (a) we see the dependence on K, which 
is approximately an error function passing through 0.5 at K = N/2; the fraction reaches 1 
at K = N. In (b) we see the dependence on N, which is 1 up to N = K and drops sharply 
at N = 2K. Panel (c) shows the dependence on N/K for K = 1000. There is a sudden 
drop in the fraction of realizable labellings when N = 2K. Panel (d) shows the values of 
logy T(N, K) and log, 2% as a function of N for K = 1000. These figures were plotted 
using the approximation of T/2 by the error function. 


But perhaps we can express T(N, K) as a linear superposition of combination 
functions of the form Ca g(N, K) = Ga By comparing tables 40.8 and 
40.6 we can see how to satisfy the boundary conditions: we simply need to 
translate Pascal’s triangle to the right by 1, 2, 3, ...; superpose; add; multiply 
by two, and drop the whole table by one line. Thus: 


nov.) =25° (No), (40.8) 


k=0 





Using the fact that the Nth row of Pascal’s triangle sums to 2%, that is, 


Ae es = INTL we can simplify the cases where K—1 > N—1. 
D. K>N 
T(N,K) = a AO. 
MOs ask ony eeN ie 
Interpretation 


It is natural to compare T(N, K) with the total number of binary functions on 
N points, 2". The ratio T(N, K)/2% tells us the probability that an arbitrary 
labelling {t,,}4_, can be memorized by our neuron. The two functions are 
equal for all N < K. The line N = K is thus a special line, defining the 
maximum number of points on which any arbitrary labelling can be realized. 
This number of points is referred to as the Vapnik—Chervonenkis dimension 
(VC dimension) of the class of functions. The VC dimension of a binary 
threshold function on K dimensions is thus K. 
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What is interesting is (for large K) the number of points N such that 
almost any labelling can be realized. The ratio T(N, K)/2% is, for N < 2K, 
still greater than 1/2, and for large K the ratio is very close to 1. 

For our purposes the sum in equation (40.9) is well approximated by the 


error function, 
5 a ~ 2N (Se) , (40.10) 


0 


where ®(z) = fZ exp(—z?/2)/\/2r. Figure 40.9 shows the realizable fraction 
T(N, K)/2% as a function of N and K. The take-home message is shown in 
figure 40.9c: although the fraction T(N, K)/2% is less than 1 for N > K, it is 
only negligibly less than 1 up to N = 2K;; there, there is a catastrophic drop 
to zero, so that for N > 2K, only a tiny fraction of the binary labellings can 
be realized by the threshold function. 


Conclusion 


The capacity of a linear threshold neuron, for large K, is 2 bits per weight. 
A single neuron can almost certainly memorize up to N = 2k random 
binary labels perfectly, but will almost certainly fail to memorize more. 


> 40.4 Further exercises 


> Exercise 40.4.!7] Can a finite set of 2N distinct points in a two-dimensional 
space be split in half by a straight line 
e if the points are in general position? 
e if the points are not in general position? 


Can 2N points in a K dimensional space be split in half by a K — 1 
dimensional hyperplane? 


Exercise 40.5.1% P-494 Four points are selected at random on the surface of a 
sphere. What is the probability that all of them lie on a single hemi- 
sphere? How does this question relate to T(N, K)? 


Y 


Exercise 40.6.!7] Consider the binary threshold neuron in K = 3 dimensions, 
= and the set of points {x} = {(1,0,0), (0,1,0), (0,0,1), (1,1,1)}. Find 
a parameter vector w such that the neuron memorizes the labels: (a) 


{t} = {1,1,1,1}; (b) {t} = {1,1,0,0}. 
Find an unrealizable labelling {t}. 


> Exercise 40.7.19] In this chapter we constrained all our hyperplanes to go 
through the origin. In this exercise, we remove this constraint. 


How many regions in a plane are created by N lines in general position? 


Figure 40.10. Three lines in a 


Exercise 40.8.[?] Estimate in bits the total sensory experience that you have plane create seven regions. 


= had in your life — visual information, auditory information, etc. Estimate 
how much information you have memorized. Estimate the information 
content of the works of Shakespeare. Compare these with the capacity of 
your brain assuming you have 101! neurons each making 1000 synaptic 
connections, and that the capacity result for one neuron (two bits per 
connection) applies. Is your brain full yet? 
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> Exercise 40.9.19] What is the capacity of the axon of a spiking neuron, viewed 
as a communication channel, in bits per second? [See MacKay and 
McCulloch (1952) for an early publication on this topic.] Multiply by 
the number of axons in the optic nerve (about 10°) or cochlear nerve 
(about 50000 per ear) to estimate again the rate of acquisition sensory 
experience. 


> 40.5 Solutions 


Solution to exercise 40.5 (p.490). The probability that all four points lie on a 
single hemisphere is 


T(4,3)/24 = 14/16 = 7/8. (40.11) 
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Learning as Inference 


> 41.1 Neural network learning as inference 


In Chapter 39 we trained a simple neural network as a classifier by minimizing 
an objective function 


M(w) = G(w) + aEw(w) (41.1) 
made up of an error function 


Gww) =-S° le Iny(x:w) + (1 — t™) In(1 — y(x™; w)) (41.2) 


n 


and a regularizer 


Ce I yu. (41.3) 


This neural network learning process can be given the following probabilistic 
interpretation. 

We interpret the output y(x; w) of the neuron literally as defining (when 
its parameters w are specified) the probability that an input x belongs to class 
t= 1, rather than the alternative t = 0. Thus y(x;w) = P(t=1|x,w). Then 
each value of w defines a different hypothesis about the probability of class 1 
relative to class 0 as a function of x. 

We define the observed data D to be the targets {t} — the inputs {x} are 
assumed to be given, and not to be modelled. To infer w given the data, we 
require a likelihood function and a prior probability over w. The likelihood 
function measures how well the parameters w predict the observed data; it is 
the probability assigned to the observed t values by the model with parameters 
set to w. Now the two equations 


Pestle 2s aa) 
can be rewritten as the single equation 
P(t|w,x) = y'(1—y)'* = expltln y + (1 — t) ln(1 — y)]. (41.5) 
So the error function G can be interpreted as minus the log likelihood: 
P(D|w) = exp[—G(w)]. (41.6) 


Similarly the regularizer can be interpreted in terms of a log prior proba- 
bility distribution over the parameters: 





P(w|a) = Z : exp(—aEw). (41.7) 


w(a) 
492 
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If Ew is quadratic as defined above, then the corresponding prior distribution 
is a Gaussian with variance o2, = 1/a, and 1/Zw (a) is equal to (a/27)*/?, 
where K is the number of parameters in the vector w. 

The objective function M(w) then corresponds to the inference of the 
parameters w, given the data: 


P(D|w)P(w | a) 


P(w| D,a) = ——PD]ay (41.8) 
- PTa) (41.9) 

1 
= za VOM (41.10) 


So the w found by (locally) minimizing M (w) can be interpreted as the (locally) 
most probable parameter vector, w*. From now on we will refer to w* as Wyp. 

Why is it natural to interpret the error functions as log probabilities? Error 
functions are usually additive. For example, G is a sum of information con- 
tents, and Ew is a sum of squared weights. Probabilities, on the other hand, 
are multiplicative: for independent events X and Y, the joint probability is 
P(x,y) = P(x)P(y). The logarithmic mapping maintains this correspondence. 

The interpretation of M(w) as a log probability has numerous benefits, 
some of which we will discuss in a moment. 


> 41.2 Illustration for a neuron with two weights 


In the case of a neuron with just two inputs and no bias, 


1 


~ Tp e wre wae)’ (41.11) 


y(x; w) 
we can plot the posterior probability of w, P(w | D,a) x exp(—M(w)). Imag- 
ine that we receive some data as shown in the left column of figure 41.1. Each 
data point consists of a two-dimensional input vector x and a t value indicated 
by x (t = 1) or O(t = 0). The likelihood function exp(—G(w)) is shown as a 
function of w in the second column. It is a product of functions of the form 
(41.11). 

The product of traditional learning is a point in w-space, the estimator w*, 
which maximizes the posterior probability density. In contrast, in the Bayesian 
view, the product of learning is an ensemble of plausible parameter values 
(bottom right of figure 41.1). We do not choose one particular hypothesis w; 
rather we evaluate their posterior probabilities. The posterior distribution is 
obtained by multiplying the likelihood by a prior distribution over w space 
(shown as a broad Gaussian at the upper right of figure 41.1). The posterior 
ensemble (within a multiplicative constant) is shown in the third column of 
figure 41.1, and as a contour plot in the fourth column. As the amount of data 
increases (from top to bottom), the posterior ensemble becomes increasingly 
concentrated around the most probable value w*. 





> 41.3 Beyond optimization: making predictions 


Let us consider the task of making predictions with the neuron which we 
trained as a classifier in section 39.3. This was a neuron with two inputs and 


a bias. 
1 


= Tp ex wo Fw uara) (41.12) 


y(x; w) 
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Figure 41.1. The Bayesian interpretation and generalization of traditional neural network learning. 
Evolution of the probability distribution over parameters as data arrive. 
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Figure 41.2. Making predictions. (a) The function performed by an optimized neuron Wmp (shown by 
three of its contours) trained with weight decay, a = 0.01 (from figure 39.6). The contours 
shown are those corresponding to a = 0,+1, namely y = 0.5,0.27 and 0.73. (b) Are these 
predictions more reasonable? (Contours shown are for y = 0.5, 0.27, 0.73, 0.12 and 0.88.) 
(c) The posterior probability of w (schematic); the Bayesian predictions shown in (b) were 
obtained by averaging together the predictions made by each possible value of the weights 
w, with each value of w receiving a vote proportional to its probability under the posterior 
ensemble. The method used to create (b) is described in section 41.4. 
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When we last played with it, we trained it by minimizing the objective function 
M(w) = G(w) + aE (w). (41.13) 


The resulting optimized function for the case œ = 0.01 is reproduced in fig- 
ure 41.2a. 

We now consider the task of predicting the class t“ corresponding to a 
new input x“*), It is common practice, when making predictions, simply to 
use a neural network with its weights fixed to their optimized value wyp, but 
this is not optimal, as can be seen intuitively by considering the predictions 
shown in figure 41.2a. Are these reasonable predictions? Consider new data 
arriving at points A and B. The best-fit model assigns both of these examples 
probability 0.2 of being in class 1, because they have the same value of Wyp X. 
If we really knew that w was equal to Wyp, then these predictions would be 
correct. But we do not know w. The parameters are uncertain. Intuitively we 
might be inclined to assign a less confident probability (closer to 0.5) at B than 
at A, as shown in figure 41.2b, since point B is far from the training data. The 
best-fit parameters Wyp often give over-confident predictions. A non-Bayesian 
approach to this problem is to downweight all predictions uniformly, by an 
empirically determined factor (Copas, 1983). This is not ideal, since intuition 
suggests the strength of the predictions at B should be downweighted more 
than those at A. A Bayesian viewpoint helps us to understand the cause of 
the problem, and provides a straightforward solution. In a nutshell, we obtain 
Bayesian predictions by taking into account the whole posterior ensemble, 
shown schematically in figure 41.2c. 

The Bayesian prediction of a new datum t* involves marginalizing over 
the parameters (and over anything else about which we are uncertain). For 
simplicity, let us assume that the weights w are the only uncertain quantities 
— the weight decay rate a and the model H itself are assumed to be fixed. 
Then by the sum rule, the predictive probability of a new target t™™® at a 
location x* is: 


PAS [xO D,a) = fw P(A [xO w, a)P(w| D,a), (41.14) 
where K is the dimensionality of w, three in the toy problem. Thus the 
predictions are obtained by weighting the prediction for each possible w, 


PEM =1]xO,wea) = yx; w) 


P(t =0 | KARN w, a) ee y (x); w), (41.15) 





with a weight given by the posterior probability of w, P(w | D,a), which we 
most recently wrote down in equation (41.10). This posterior probability is 


P(w|D,a) = Fa ov(- Mw), (41.16) 
where 
Zu = J aksrexp(—M(w)) (41.17) 


In summary, we can get the Bayesian predictions if we can find a way of 
computing the integral 


1 
POOP =1 |x, D,a) = fw y(x; w) z7 SP(-M(w)), (41.18) 
M 


which is the average of the output of the neuron at x under the posterior 
distribution of w. 
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Figure 41.3. One step of the 
Langevin method in two 
dimensions (c), contrasted with a 
traditional ‘dumb’ Metropolis 
method (a) and with gradient 
descent (b). The proposal density 
$ (b) ¢ (c) € of the Langevin method is given 
Dumb Metropolis Gradient descent Langevin by ‘gradient descent with noise’. 





(a) 


Implementation 


How shall we compute the integral (41.18)? For our toy problem, the weight 
space is three dimensional; for a realistic neural network the dimensionality 
K might be in the thousands. 

Bayesian inference for general data modelling problems may be imple- 
mented by exact methods (Chapter 25), by Monte Carlo sampling (Chapter 
29), or by deterministic approximate methods, for example, methods that 
make Gaussian approximations to P(w | D,a) using Laplace’s method (Chap- 
ter 27) or variational methods (Chapter 33). For neural networks there are few 
exact methods. The two main approaches to implementing Bayesian inference 
for neural networks are the Monte Carlo methods developed by Neal (1996) 
and the Gaussian approximation methods developed by MacKay (1991). 


> 41.4 Monte Carlo implementation of a single neuron 


First we will use a Monte Carlo approach in which the task of evaluating the 
integral (41.18) is solved by treating y(x““™; w) as a function f of w whose 
mean we compute using 


(f(w)) = 5 fw) (41.19) 


where {w} are samples from the posterior distribution Zexp(-M (w)) (cf. 
equation (29.6)). We obtain the samples using a Metropolis method (section 
29.4). As an aside, a possible disadvantage of this Monte Carlo approach is 
that it is a poor way of estimating the probability of an improbable event, i.e., 
a P(t|D,H) that is very close to zero, if the improbable event is most likely 
to occur in conjunction with improbable parameter values. 

How to generate the samples {w‘")}? Radford Neal introduced the Hamil- 
tonian Monte Carlo method to neural networks. We met this sophisticated 
Metropolis method, which makes use of gradient information, in Chapter 30. 
The method we now demonstrate is a simple version of Hamiltonian Monte 
Carlo called the Langevin Monte Carlo method. 


The Langevin Monte Carlo method 


The Langevin method (algorithm 41.4) may be summarized as ‘gradient de- 
scent with added noise’, as shown pictorially in figure 41.3. A noise vector p 
is generated from a Gaussian with unit variance. The gradient g is computed, 
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gradM (w) ; 
findM (w ) ; 


for 1 = 1:L 
= randn ( size(w) ) ; 
p>*p/2+M; 


p - epsilon * g / 2; 

= w + epsilon * p ; 
gradM ( wnew ) ; 
- epsilon * gnew / 2 ; 


findM ( wnew ) ; 
Hnew = p? * p / 2 + Mnew ; 
dH = Hnew - H 
if ( dH<0O) 


? 


accept 


elseif ( rand() < exp(-dH) ) accept = 


else accept 

endif 

if ( accept ) 
endfor 


g = gnew ; we 


function gM = gradM ( w ) 
x*W 3 
= sigmoid(a) ; 
=t-y 3 
-x *e; 
gM = alpha * wt g ; 
endfunction 


function M = findM ( w ) 


set gradient using initial w 
set objective function too 


loop L times 
initial momentum is Normal (0,1) 
evaluate H(w,p) 


make half-step in p 
make step in w 

find new gradient 
make half-step in p 


find new objective function 
evaluate new value of H 
decide whether to accept 

; 

; # compare with a uniform 
; # variate 


M = Mnew ; endif 


gradient of objective function 
compute activations 

compute outputs 

compute errors 

compute the gradient of G(w) 


# objective function 


G=- (t? * log(y) + (1-t’) * log( 1-y )) ; 


EW =w *w/2; 
M = G + alpha * EW ; 
endfunction 





497 


Algorithm 41.4. Octave source 
code for the Langevin Monte 
Carlo method. To obtain the 
Hamiltonian Monte Carlo method, 
we repeat the four lines marked * 
multiple times (algorithm 41.8). 
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Figure 41.5. A single neuron learning under the Langevin Monte Carlo method. (a) Evolution of 
weights wo, wı and w2 as a function of number of iterations. (b) Evolution of weights w1 
and w2 in weight space. Also shown by a line is the evolution of the weights using the 
optimizer of figure 39.6. (c) The error function G(w) as a function of number of iterations. 
Also shown is the error function during the optimization of figure 39.6. (d) The objective 
function M(x) as a function of number of iterations. See also figures 41.6 and 41.7. 


and a step in w is made, given by 
Aw = -ieg + ep. (41.20) 


Notice that if the ep term were omitted this would simply be gradient descent 
with learning rate 7 = $e. This step in w is accepted or rejected depending 
on the change in the value of the objective function M(w) and on the change 
in gradient, with a probability of acceptance such that detailed balance holds. 

The Langevin method has one free parameter, €, which controls the typical 
step size. If € is set to too large a value, moves may be rejected. If it is set to 
a very small value, progress around the state space will be slow. 


Demonstration of Langevin method 


The Langevin method is demonstrated in figures 41.5, 41.6 and 41.7. Here, the 
objective function is M(w) = G(w) + aEw (w), with a = 0.01. These figures 
include, for comparison, the results of the previous optimization method using 
gradient descent on the same objective function (figure 39.6). It can be seen 
that the mean evolution of w is similar to the evolution of the parameters 
under gradient descent. The Monte Carlo method appears to have converged 
to the posterior distribution after about 10000 iterations. 

The average acceptance rate during this simulation was 93%; only 7% of 
the proposed moves were rejected. Probably, faster progress around the state 
space would have been made if a larger step size € had been used, but the 
value was chosen so that the ‘descent rate’ 7 = 4 e? matched the step size of 
the earlier simulations. 


Making Bayesian predictions 


From iteration 10,000 to 40,000, the weights were sampled every 1000 itera- 
tions and the corresponding functions of x are plotted in figure 41.6. There 
is a considerable variety of plausible functions. We obtain a Monte Carlo ap- 
proximation to the Bayesian predictions by averaging these thirty functions of 
x together. The result is shown in figure 41.7 and contrasted with the predic- 
tions given by the optimized parameters. The Bayesian predictions become 
satisfyingly moderate as we move away from the region of highest data density. 
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Figure 41.6. Samples obtained by 
the Langevin Monte Carlo 
method. The learning rate was set 
to ņ = 0.01 and the weight decay 
rate to a = 0.01. The step size is 
given by € = /27. The function 
performed by the neuron is shown 
by three of its contours every 1000 
iterations from iteration 10000 to 
40000. The contours shown are 
those corresponding to a = 0, +1, 
namely y = 0.5, 0.27 and 0.73. 
Also shown is a vector 
proportional to (w1, w2). 
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Figure 41.7. Bayesian predictions found by the Langevin Monte Carlo method compared with the 
predictions using the optimized parameters. (a) The predictive function obtained by av- 
eraging the predictions for 30 samples uniformly spaced between iterations 10000 and 
40 000, shown in figure 41.6. The contours shown are those corresponding to a = 0, +1, +2, 
namely y = 0.5,0.27, 0.73, 0.12 and 0.88. (b) For contrast, the predictions given by the 
‘most probable’ setting of the neuron’s parameters, as given by optimization of M(w). 
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wnew = 3 


gnew = g; 
for tau = 1:Tau 


p =p - epsilon * gnew / 2 ; 
wnew = wnew + epsilon * p ; 


# make half-step in p 
# make step in w 


gnew = gradM ( wnew ) ; 
p = p - epsilon * gnew / 2 ; 


# find new gradient 
# make half-step in p 


endfor 
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The Bayesian classifier is better able to identify the points where the classi- 
fication is uncertain. This pleasing behaviour results simply from a mechanical 
application of the rules of probability. 


Optimization and typicality 


A final observation concerns the behaviour of the functions G(w) and M(w) 
during the Monte Carlo sampling process, compared with the values of G and 
M at the optimum wypr (figure 41.5). The function G(w) fluctuates around 
the value of G(wyp), though not in a symmetrical way. The function M(w) 
also fluctuates, but it does not fluctuate around M(wyp) — obviously it cannot, 
because M is minimized at wyp, so M could not go any smaller — furthermore, 
M only rarely drops close to M(wyp). In the language of information theory, 
the typical set of w has different properties from the most probable state wyp. 

A general message therefore emerges — applicable to all data models, not 
just neural networks: one should be cautious about making use of optimized 
parameters, as the properties of optimized parameters may be unrepresen- 
tative of the properties of typical, plausible parameters; and the predictions 
obtained using optimized parameters alone will often be unreasonably over- 
confident. 


Reducing random walk behaviour using Hamiltonian Monte Carlo 


As a final study of Monte Carlo methods, we now compare the Langevin Monte 
Carlo method with its big brother, the Hamiltonian Monte Carlo method. The 
change to Hamiltonian Monte Carlo is simple to implement, as shown in algo- 
rithm 41.8. Each single proposal makes use of multiple gradient evaluations 
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41 — Learning as Inference 


Algorithm 41.8. Octave source 
code for the Hamiltonian Monte 
Carlo method. The algorithm is 
identical to the Langevin method 
in algorithm 41.4, except for the 
replacement of the four lines 
marked * in that algorithm by the 
fragment shown here. 


Figure 41.9. Comparison of 
sampling properties of the 
Langevin Monte Carlo method 
and the Hamiltonian Monte Carlo 
(HMC) method. The horizontal 
axis is the number of gradient 
evaluations made. Each figure 
shows the weights during the first 
10,000 iterations. The rejection 
rate during this Hamiltonian 
Monte Carlo simulation was 8%. 
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41.5: Implementing inference with Gaussian approximations 


along a dynamical trajectory in w, p space, where p are the extra ‘momentum’ 
variables of the Langevin and Hamiltonian Monte Carlo methods. The num- 
ber of steps ‘Tau’ was set at random to a number between 100 and 200 for each 
trajectory. The step size € was kept fixed so as to retain comparability with 
the simulations that have gone before; it is recommended that one randomize 
the step size in practical applications, however. 

Figure 41.9 compares the sampling properties of the Langevin and Hamil- 
tonian Monte Carlo methods. The autocorrelation of the state of the Hamil- 
tonian Monte Carlo simulation falls much more rapidly with simulation time 
than that of the Langevin method. For this toy problem, Hamiltonian Monte 
Carlo is at least ten times more efficient in its use of computer time. 


41.5 Implementing inference with Gaussian approximations 


Physicists love to take nonlinearities and locally linearize them, and they love 
to approximate probability distributions by Gaussians. Such approximations 
offer an alternative strategy for dealing with the integral 


1 
PAM =1|x0® D, a) = fw y(x®™®; wy- exp(—M(w)), (41.21) 
M 
which we just evaluated using Monte Carlo methods. 
We start by making a Gaussian approximation to the posterior probability. 
We go to the minimum of M (w) (using a gradient-based optimizer) and Taylor- 
expand M there: 


M(w) ~ M(WwWmr) + sw — Wur) A(w — Wur) +=, (41.22) 


where A is the matrix of second derivatives, also known as the Hessian, defined 
by 
82 
Ay; = ——— M . 41.23 
= Mm] (41.23) 
W=WMP 


We thus define our Gaussian approximation: 
1 
Q(w: Wur, A) = [det(A /27)]"/? exp [5 —Wwwmp)'A(w—wwp)|. (41.24) 


We can think of the matrix A as defining error bars on w. To be precise, Q 
is a normal distribution whose variance—covariance matrix is Av}. 


Exercise 41.1.!7] Show that the second derivative of M(w) with respect to w 
is given by 


N 


ie 7 -(n)\,(0) (0) 
Buda MO) = Daf (a ai a7 + abiy, (41.25) 


where f'(a) is the first derivative of f(a) =1/(1 +7“), which is 


(a) = È la) = AAA - F(a) (41.26) 
and 
a™ = 5 wz, (41.27) 


Having computed the Hessian, our task is then to perform the integral (41.21) 
using our Gaussian approximation. 
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Figure 41.10. The marginalized 
probability, and an approximation 
to it. (a) The function 7(a, s”), 
evaluated numerically. In (b) the 
functions 7)(a, s?) and ¢(a, 8s?) 
defined in the text are shown as a 
function of a for s? = 4. From 
MacKay (1992b). 














Figure 41.11. The Gaussian 
approximation in weight space 
and its approximate predictions in 
input space. (a) A projection of 
the Gaussian approximation onto 
the (w1, w2) plane of weight 
space. The one- and 
two-standard-deviation contours 
are shown. Also shown are the 
trajectory of the optimizer, and 
the Monte Carlo method’s 
samples. (b) The predictive 
function obtained from the 
Calculating the marginalized probability Gaussian approximation and 
equation (41.30). (cf. figure 41.2.) 























The output y(x; w) depends on w only through the scalar a(x; w), so we can 
reduce the dimensionality of the integral by finding the probability density of 
a. We are assuming a locally Gaussian posterior probability distribution over 
w = Wyp + Aw, P(w|D,a) ~ (1/Zg) exp(—5Aw'AAw). For our single 
neuron, the activation a(x; w) is a linear function of w with 0a/Ow = x, so 
for any x, the activation a is Gaussian-distributed. 


> Exercise 41,2 |?! Assuming w is Gaussian-distributed with mean wyp and 
variance-covariance matrix A1, show that the probability distribution 


of a(x) is 
ps 2 
exp (L aue) ; 


P(a|x, D,a) = Normal(ayp, 5°) = 552 
s 
(41.28) 





27s? 
where amp =a(X;Wyp) and s? =x'A 1x. 


This means that the marginalized output is: 
P(t=1|x, D,a) = (amr, s”) = fa f(a) Normal (ayp, $°). (41.29) 


This is to be contrasted with y(x; Wmr ) = f (ame), the output of the most prob- 
able network. The integral of a sigmoid times a Gaussian can be approximated 
by: 

Ylamr, 8) ~ blamp, 3°) = f(K(s)amp) (41.30) 


with « = 1/,/1+7s?/8 (figure 41.10). 


Demonstration 


Figure 41.11 shows the result of fitting a Gaussian approximation at the op- 
timum Wyp, and the results of using that Gaussian approximation and equa- 
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41.5: Implementing inference with Gaussian approximations 503 


tion (41.30) to make predictions. Comparing these predictions with those of 
the Langevin Monte Carlo method (figure 41.7) we observe that, whilst quali- 
tatively the same, the two are clearly numerically different. So at least one of 
the two methods is not completely accurate. 


> Exercise 41.3.!?] Is the Gaussian approximation to P(w | D, a) too heavy-tailed 
or too light-tailed, or both? It may help to consider P(w|D,a) as a 
function of one parameter w; and to think of the two distributions on 
a logarithmic scale. Discuss the conditions under which the Gaussian 
approximation is most accurate. 


Why marginalize? 


If the output is immediately used to make a (0/1) decision and the costs asso- 
ciated with error are symmetrical, then the use of marginalized outputs under 
this Gaussian approximation will make no difference to the performance of the 
classifier, compared with using the outputs given by the most probable param- 
eters, since both functions pass through 0.5 at ayp=0. But these Bayesian 
outputs will make a difference if, for example, there is an option of saying ‘I 
don’t know’, in addition to saying ‘I guess 0’ and ‘I guess 1’. And even if 
there are just the two choices ‘0’ and ‘1’, if the costs associated with error are 
unequal, then the decision boundary will be some contour other than the 0.5 
contour, and the boundary will be affected by marginalization. 
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Postscript on Supervised Neural 
Networks 


One of my students, Robert, asked: 


Maybe I’m missing something fundamental, but supervised neural 
networks seem equivalent to fitting a pre-defined function to some 
given data, then extrapolating — what’s the difference? 


I agree with Robert. The supervised neural networks we have studied so far 
are simply parameterized nonlinear functions which can be fitted to data. 
Hopefully you will agree with another comment that Robert made: 


Unsupervised networks seem much more interesting than their su- 
pervised counterparts. m amazed that it works! 
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Hopfield Networks 


We have now spent three chapters studying the single neuron. The time has 
come to connect multiple neurons together, making the output of one neuron 
be the input to another, so as to make neural networks. 

Neural networks can be divided into two classes on the basis of their con- 
nectivity. 


Figure 42.1. (a) A feedforward 
network. (b) A feedback network. 





Feedforward networks. In a feedforward network, all the connections are 
directed such that the network forms a directed acyclic graph. 


Feedback networks. Any network that is not a feedforward network will be 
called a feedback network. 


In this chapter we will discuss a fully connected feedback network called 
the Hopfield network. The weights in the Hopfield network are constrained to 
be symmetric, i.e., the weight from neuron 7 to neuron j is equal to the weight 
from neuron j to neuron i. 

Hopfield networks have two applications. First, they can act as associative 
memories. Second, they can be used to solve optimization problems. We will 
first discuss the idea of associative memory, also known as content-addressable 
memory. 


»> 42.1 Hebbian learning 


In Chapter 38, we discussed the contrast between traditional digital memories 
and biological memories. Perhaps the most striking difference is the associative 
nature of biological memory. 

A simple model due to Donald Hebb (1949) captures the idea of associa- 
tive memory. Imagine that the weights between neurons whose activities are 
positively correlated are increased: 


dwij 


ot Correlation(x;, £j). (42.1) 


Now imagine that when stimulus m is present (for example, the smell of a 
banana), the activity of neuron m increases; and that neuron n is associated 
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with another stimulus, n (for example, the sight of a yellow object). If these 
two stimuli — a yellow sight and a banana smell — co-occur in the environment, 
then the Hebbian learning rule (42.1) will increase the weights Wnm and wmn. 
This means that when, on a later occasion, stimulus n occurs in isolation, mak- 
ing the activity x, large, the positive weight from n to m will cause neuron m 
also to be activated. Thus the response to the sight of a yellow object is an 
automatic association with the smell of a banana. We could call this ‘pattern 
completion’. No teacher is required for this associative memory to work. No 
signal is needed to indicate that a correlation has been detected or that an as- 
sociation should be made. The unsupervised, local learning algorithm and the 
unsupervised, local activity rule spontaneously produce associative memory. 

This idea seems so simple and so effective that it must be relevant to how 
memories work in the brain. 


»> 42.2 Definition of the binary Hopfield network 


Convention for weights. Our convention in general will be that wij denotes 
the connection from neuron j to neuron i. 


Architecture. A Hopfield network consists of J neurons. They are fully 
connected through symmetric, bidirectional connections with weights 
wij = wji. There are no self-connections, so wi = 0 for all 7. Biases 
wio may be included (these may be viewed as weights from a neuron ‘0’ 
whose activity is permanently rq = 1). We will denote the activity of 
neuron t (its output) by zi. 


Activity rule. Roughly, a Hopfield network’s activity rule is for each neu- 
ron to update its state as if it were a single neuron with the threshold 
activation function 


x(a) = O(a) = { A ay 4 (42.2) 


Since there is feedback in a Hopfield network (every neuron’s output is 
an input to all the other neurons) we will have to specify an order for the 
updates to occur. The updates may be synchronous or asynchronous. 


Synchronous updates — all neurons compute their activations 
ay = pane (42.3) 
j 


then update their states simultaneously to 
ti = O(a;). (42.4) 


Asynchronous updates — one neuron at a time computes its activa- 
tion and updates its state. The sequence of selected neurons may 
be a fixed sequence or a random sequence. 


The properties of a Hopfield network may be sensitive to the above 
choices. 


Learning rule. The learning rule is intended to make a set of desired memo- 
ries {x(")} be stable states of the Hopfield network’s activity rule. Each 
memory is a binary pattern, with x; € {—1,1}. 
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moscow------ russia 
lima---------- peru , 
MOSCOW--—iiiiiiit: => moscow------ russia 
london----- england (b) 
= fitiiiiisic-canada = > ottawa------ canada 
tokyo-------- japan 
edinburgh-scotland 
ottawa------ canada 
es a AOSGa ( ) otowa------- canada = > ottawa------ canada 
y egindurrh-sxotland = > edinburgh-scotland 
stockholm---sweden 
paris------- france 








Figure 42.2. Associative memory 
(schematic). (a) A list of desired 
memories. (b) The first purpose of 
an associative memory is pattern 
= (n) (n) completion, given a partial 
Wij = 1 D Ta a (42.5) pattern. (c) The second purpose 
< of a memory is error correction. 


The weights are set using the sum of outer products or Hebb rule, 


where ņ is an unimportant constant. To prevent the largest possible 
weight from growing with N we might choose to set n = 1/N. 


Exercise 42.1.[4] Explain why the value of 7 is not important for the Hopfield 
` network defined above. 


»> 42.3 Definition of the continuous Hopfield network 


Using the identical architecture and learning rule we can define a Hopfield 
network whose activities are real numbers between —1 and 1. 


Activity rule. A Hopfield network’s activity rule is for each neuron to up- 
date its state as if it were a single neuron with a sigmoid activation 
function. The updates may be synchronous or asynchronous, and in- 
volve the equations 


Qi = X wij; (42.6) 
J 


and 
zi = tanh(a;). (42.7) 


The learning rule is the same as in the binary Hopfield network, but the 
value of 7 becomes relevant. Alternatively, we may fix 7 and introduce a gain 
B € (0,00) into the activation function: 


x; = tanh(Ga;). (42.8) 


Exercise 42.2.!4] Where have we encountered equations 42.6, 42.7, and 42.8 
= before? 


> 42.4 Convergence of the Hopfield network 


The hope is that the Hopfield networks we have defined will perform associa- 
tive memory recall, as shown schematically in figure 42.2. We hope that the 
activity rule of a Hopfield network will take a partial memory or a corrupted 
memory, and perform pattern completion or error correction to restore the 
original memory. 

But why should we expect any pattern to be stable under the activity rule, 
let alone the desired memories? 

We address the continuous Hopfield network, since the binary network is 
a special case of it. We have already encountered the activity rule (42.6, 42.8) 
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when we discussed variational methods (section 33.2): when we approximated 
the spin system whose energy function was 


E(x; J) = -5 y Jantin — ze hinti (42.9) 


with a separable distribution 


Q(x;a) = A exp (= ont) (42.10) 


and optimized the latter so as to minimize the variational free energy 


BPa = BY) Qa) -E Qiang, (21) 


we found that the pair of iterative equations 


am = B bs Ince hn] (42.12) 


and 
Tn = tanh(ay) (42.13) 


were guaranteed to decrease the variational free energy 


BF(a) = 6 (-4 So Irina, man) SN HS? (an). (42.14) 


If we simply replace J by w, z by x, and hn by wy, we see that the 
equations of the Hopfield network are identical to a set of mean-field equations 
that minimize 


B(x) = -p}x'Wx - D HPA + 2) 2 (42.15) 


There is a general name for a function that decreases under the dynamical 
evolution of a system and that is bounded below: such a function is a Lyapunov 
function for the system. It is useful to be able to prove the existence of 
Lyapunov functions: if a system has a Lyapunov function then its dynamics 
are bound to settle down to a fixed point, which is a local minimum of the 
Lyapunov function, or a limit cycle, along which the Lyapunov function is a 
constant. Chaotic behaviour is not possible for a system with a Lyapunov 
function. If a system has a Lyapunov function then its state space can be 
divided into basins of attraction, one basin associated with each attractor. 

So, the continuous Hopfield network’s activity rules (if implemented asyn- 
chronously) have a Lyapunov function. This Lyapunov function is a convex 
function of each parameter a; so a Hopfield network’s dynamics will always 
converge to a stable fixed point. 

This convergence proof depends crucially on the fact that the Hopfield 
network’s connections are symmetric. It also depends on the updates being 
made asynchronously. 


Exercise 42.3.12 P-520] Show by constructing an example that if a feedback 
= network does not have symmetric connections then its dynamics may 
fail to converge to a fixed point. 


Exercise 42.4.12 P-521] Show by constructing an example that if a Hopfield 
> network is updated synchronously that, from some initial conditions, it 
may fail to converge to a fixed point. 


42.4: Convergence of the Hopfield network 
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Figure 42.3. Binary Hopfield 
network storing four memories. 

(a) The four memories, and the 
weight matrix. (b-h) Initial states 
that differ by one, two, three, four, 
or even five bits from a desired 
memory are restored to that 
memory in one or two iterations. 
(ism) Some initial conditions that 
are far from the memories lead to 
stable states other than the four 
memories; in (i), the stable state 
looks like a mixture of two 
memories, ‘D’ and ‘J’; stable state 
(j) is like a mixture of ‘J’ and ‘C’; 
in (k), we find a corrupted version 
of the ‘M’ memory (two bits 
distant); in (1) a corrupted version 
of ‘J’ (four bits distant) and in 
(m), a state which looks spurious 
until we recognize that it is the 
inverse of the stable state (1). 
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»> 42.5 The associative memory in action 


Figure 42.3 shows the dynamics of a 25-unit binary Hopfield network that 
has learnt four patterns by Hebbian learning. The four patterns are displayed 
as five by five binary images in figure 42.3a. For twelve initial conditions, 
panels (b-m) show the state of the network, iteration by iteration, all 25 
units being updated asynchronously in each iteration. For an initial condition 
randomly perturbed from a memory, it often only takes one iteration for all 
the errors to be corrected. The network has more stable states in addition 
to the four desired memories: the inverse of any stable state is also a stable 
state; and there are several stable states that can be interpreted as mixtures 
of the memories. 


Brain damage 


The network can be severely damaged and still work fine as an associative 
memory. If we take the 300 weights of the network shown in figure 42.3 and 
randomly set 50 or 100 of them to zero, we still find that the desired memories 
are attracting stable states. Imagine a digital computer that still works fine 
even when 20% of its components are destroyed! 


> Exercise 42.5,[9C] Implement a Hopfield network and confirm this amazing 
robust error-correcting capability. 


More memories 


We can squash more memories into the network too. Figure 42.4a shows a set 
of five memories. When we train the network with Hebbian learning, all five 
memories are stable states, even when 26 of the weights are randomly deleted 
(as shown by the ‘x’s in the weight matrix). However, the basins of attraction 
are smaller than before: figures 42.4(b-f) show the dynamics resulting from 
randomly chosen starting states close to each of the memories (3 bits flipped). 
Only three of the memories are recovered correctly. 

If we try to store too many patterns, the associative memory fails catas- 
trophically. When we add a sixth pattern, as shown in figure 42.5, only one 
of the patterns is stable; the others all flow into one of two spurious stable 
states. 


> 42.6 The continuous-time continuous Hopfield network 


The fact that the Hopfield network’s properties are not robust to the minor 
change from asynchronous to synchronous updates might be a cause for con- 
cern; can this model be a useful model of biological networks? It turns out 
that once we move to a continuous-time version of the Hopfield networks, this 
issue melts away. 

We assume that each neuron’s activity x; is a continuous function of time 
x,(t) and that the activations a;(t) are computed instantaneously in accordance 
with 


ailt) = 2 wijzj(t). (42.16) 


The neuron’s response to its activation is assumed to be mediated by the 
differential equation: 


aril) = —=(ai(t) — f (ai)), (42.17) 
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(MC & 
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-1 . 3 5-1-1-3-1-3-1-3 1 x 1-3 1-1-1-1-1-3 5 3 3-3 
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Figure 42.4. Hopfield network 
storing five memories, and 
suffering deletion of 26 of its 300 
weights. (a) The five memories, 
and the weights of the network, 
with deleted weights shown by ‘x’. 
(b-f) Initial states that differ by 
three random bits from a 
memory: some are restored, but 
some converge to other states. 


Figure 42.5. An overloaded 
Hopfield network trained on six 
memories, most of which are not 
stable. 
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Figure 42.6. Failure modes of a 
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(2) errors; (2) desired memories that 














norway oslo-------- norway are completely lost (there is no 
stockholm---sweden stockholm---sweden attracting stable state at the 
france paris------- france desired memory or near it); (3) 
wzkmhewn--xqwqwpoq | (3) spurious stable states unrelated to 
paris------- sweden | (4) the original list; (4) spurious 
ecnarf------- sirap | (4) stable states that are 


confabulations of desired 


where f(a) is the activation function, for example f(a) = tanh(a). For a ry: 


steady activation a;, the activity x;(t) relaxes exponentially to f(a,;) with 
time-constant T. 

Now, here is the nice result: as long as the weight matrix is symmetric, 
this system has the variational free energy (42.15) as its Lyapunov function. 


> Exercise 42.6,!11 By computing iF , prove that the variational free energy 
F(x) is a Lyapunov function for the continuous-time Hopfield network. 


It is particularly easy to prove that a function L is a Lyapunov function if the 
system’s dynamics perform steepest descent on L, with Slt) x al. In 
the case of the continuous-time continuous Hopfield network, it is not quite 
so simple, but every component of taj (t) does have the same sign as FP ; 
which means that with an appropriately defined metric, the Hopfield network 
dynamics do perform steepest descents on F (x). 


»> 42.7 The capacity of the Hopfield network 


One way in which we viewed learning in the single neuron was as communica- 
tion — communication of the labels of the training data set from one point in 
time to a later point in time. We found that the capacity of a linear threshold 
neuron was 2 bits per weight. 

Similarly, we might view the Hopfield associative memory as a commu- 
nication channel (figure 42.6). A list of desired memories is encoded into a 
set of weights W using the Hebb rule of equation (42.5), or perhaps some 
other learning rule. The receiver, receiving the weights W only, finds the 
stable states of the Hopfield network, which he interprets as the original mem- 
ories. This communication system can fail in various ways, as illustrated in 
the figure. 


1. Individual bits in some memories might be corrupted, that is, a sta- 
ble state of the Hopfield network is displaced a little from the desired 
memory. 


2. Entire memories might be absent from the list of attractors of the net- 
work; or a stable state might be present but have such a small basin of 
attraction that it is of no use for pattern completion and error correction. 


3. Spurious additional memories unrelated to the desired memories might 
be present. 


4. Spurious additional memories derived from the desired memories by op- 
erations such as mixing and inversion may also be present. 
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Of these failure modes, modes 1 and 2 are clearly undesirable, mode 2 espe- 
cially so. Mode 3 might not matter so much as long as each of the desired 
memories has a large basin of attraction. The fourth failure mode might in 
some contexts actually be viewed as beneficial. For example, if a network is 
required to memorize examples of valid sentences such as ‘John loves Mary’ 
and ‘John gets cake’, we might be happy to find that ‘John loves cake’ was also 
a stable state of the network. We might call this behaviour ‘generalization’. 

The capacity of a Hopfield network with J neurons might be defined to be 
the number of random patterns N that can be stored without failure-mode 2 
having substantial probability. If we also require failure-mode 1 to have tiny 
probability then the resulting capacity is much smaller. We now study these 
alternative definitions of the capacity. 


The capacity of the Hopfield network — stringent definition 


We will first explore the information storage capabilities of a binary Hopfield 
network that learns using the Hebb rule by considering the stability of just 
one bit of one of the desired patterns, assuming that the state of the network 
is set to that desired pattern x. We will assume that the patterns to be 
stored are randomly selected binary patterns. 

The activation of a particular neuron 7 is 


j 
where the weights are, for i Æ j, 
W= ggm + 5 aM, (42.19) 
mÆn 


Here we have split W into two terms, the first of which will contribute ‘signal’, 
reinforcing the desired memory, and the second ‘noise’. Substituting for Wij, 


the activation is 
DAPP D a a 


Qi = 
ix JFi men 
S E ey Sea (42.21) 
j#i men 


The first term is (I — 1) times the desired state ol, If this were the only 
term, it would keep the neuron firmly clamped in the desired state. The 
(m) Cm) .(”) A 


second term is a sum of (I — 1)(N — 1) random quantities £} £} ‘a; ’. / —\ 
moment’s reflection confirms that these quantities are independent random / VIN \ 
binary variables with mean 0 and variance 1. fl | 

I 


Thus, considering the statistics of a; under the ensemble of random pat- 


Qi 
terns, we conclude that a; has mean (I — 1)2{”) and variance (I — 1)(N — 1). 
For brevity, we will now assume I and N are large enough that we can 
neglect the distinction between J and J —1, and between N and N — K ae aaa ol”) —1; the probability that 
we can restate our conclusion: a; is Gaussian-distributed with mean Ix; ° and bit i becomes flipped is the area 
variance IN. of the tail. 
What then is the probability that the selected bit is stable, if we put the 
network into the state x‘? The probability that bit i will flip on the first 
iteration of the Hopfield network’s dynamics is 


EE eee eee 
P(i unstable) = ® ( =) = ( cn} ; (42.22) 


Figure 42.7. The probability 
density of the activation a; in the 
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Q=] ez (42.23) 


The important quantity N/I is the ratio of the number of patterns stored to 
the number of neurons. If, for example, we try to store N ~ 0.18/ patterns 
in the Hopfield network then there is a chance of 1% that a specified bit in a 
specified pattern will be unstable on the first iteration. 

We are now in a position to derive our first capacity result, for the case 
where no corruption of the desired memories is permitted. 


> Exercise 42.7.7] Assume that we wish all the desired patterns to be completely 
stable — we don’t want any of the bits to flip when the network is put 
into any desired pattern state — and the total probability of any error at 
all is required to be less than a small number e. Using the approximation 
to the error function for large z, 


1 e- 2/2 
®(—z) x — ; 42.24 
(-2) «ee (42.24) 
show that the maximum number of patterns that can be stored, Nmax, 
is 





I 


Ninax e a 
4ln I + 2In(1/e) 


(42.25) 
If, however, we allow a small amount of corruption of memories to occur, the 
number of patterns that can be stored increases. 


The statistical physicists? capacity 


The analysis that led to equation (42.22) tells us that if we try to store N ~ 
0.187 patterns in the Hopfield network then, starting from a desired memory, 
about 1% of the bits will be unstable on the first iteration. Our analysis does 
not shed light on what is expected to happen on subsequent iterations. The 
flipping of these bits might make some of the other bits unstable too, causing 
an increasing number of bits to be flipped. This process might lead to an 
avalanche in which the network’s state ends up a long way from the desired 
memory. 

In fact, when N/T is large, such avalanches do happen. When N/T is small, 
they tend not to — there is a stable state near to each desired memory. For the 
limit of large I, Amit et al. (1985) have used methods from statistical physics 
to find numerically the transition between these two behaviours. There is a 
sharp discontinuity at 

Nerit = 0.1387. (42.26) 


42 — Hopfield Networks 


Figure 42.8. Overlap between a 
desired memory and the stable 
state nearest to it as a function of 
the loading fraction N/I. The 
overlap is defined to be the scaled 
inner product 5°, av!” /T, which 
is 1 when recall is perfect and zero 
when the stable state has 50% of 
the bits flipped. There is an 
abrupt transition at N/I = 0.138, 
where the overlap drops from 0.97 
to zero. 
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Below this critical value, there is likely to be a stable state near every desired 
memory, in which a small fraction of the bits are flipped. When N/T exceeds 
0.138, the system has only spurious stable states, known as spin glass states, 
none of which is correlated with any of the desired memories. Just below the 
critical value, the fraction of bits that are flipped when a desired memory has 
evolved to its associated stable state is 1.6%. Figure 42.8 shows the overlap 
between the desired memory and the nearest stable state as a function of N/I. 

Some other transitions in properties of the model occur at some additional 
values of N/I, as summarized below. 


For all N/I, stable spin glass states exist, uncorrelated with the desired 
memories. 


For N/I > 0.138, these spin glass states are the only stable states. 
For N/I € (0,0.138), there are stable states close to the desired memories. 


For N/I € (0,0.05), the stable states associated with the desired memories 
have lower energy than the spurious spin glass states. 


For N/I € (0.05,0.138), the spin glass states dominate — there are spin glass 
states that have lower energy than the stable states associated with the 
desired memories. 


For N/I € (0,0.03), there are additional mixture states, which are combina- 
tions of several desired memories. These stable states do not have as low 
energy as the stable states associated with the desired memories. 


In conclusion, the capacity of the Hopfield network with J neurons, if we 
define the capacity in terms of the abrupt discontinuity discussed above, is 
0.1382 random binary patterns, each of length J, each of which is received 


with 1.6% of its bits flipped. In bits, this capacity is This expression for the capacity 
omits a smaller negative term of 
0.1387? x (1 — H(0.016)) = 0.122 J? bits. (42.27) order N logy N bits, associated 


with the arbitrary order of the 
Since there are /*/2 weights in the network, we can also express the capacity memories. 
as 0.24 bits per weight. 


»> 42.8 Improving on the capacity of the Hebb rule 


The capacities discussed in the previous section are the capacities of the Hop- 
field network whose weights are set using the Hebbian learning rule. We can 
do better than the Hebb rule by defining an objective function that measures 
how well the network stores all the memories, and minimizing it. 

For an associative memory to be useful, it must be able to correct at 
least one flipped bit. Let’s make an objective function that measures whether 
flipped bits tend to be restored correctly. Our intention is that, for every 
neuron i in the network, the weights to that neuron should satisfy this rule: 


for every pattern x, if the neurons other than i are set correctly 


to £j = a, then the activation of neuron i should be such that 


j 

its preferred output is x; = aol”), 
Is this rule a familiar idea? Yes, it is precisely what we wanted the single 
neuron of Chapter 39 to do. Each pattern x‘ defines an input, target pair 
for the single neuron i. And it defines an input, target pair for all the other 
neurons too. 
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Algorithm 42.9. Octave source 


w=x’? * xX; # initialize the weights using Hebb rule code for optimizing the weights of 
a Hopfield network, so that it 
for 1 = 1:L # loop L times works as an associative memory. 


cf. algorithm 39.5. The data 
matrix x has J columns and N 
rows. The matrix t is identical to 
x except that —1s are replaced by 
Os. 


for i=1:I # 
w(i,i) = 0 ; # = ensure self-weights are zero. 
end # 


=x *w ; compute all activations 
sigmoid(a) ; compute all outputs 
=t-y ; compute all errors 
x’? * e ; compute the gradients 
gw + gw’ ; symmetrize gradients 


w + eta * ( gw - alpha * w ) ; # make step 


endfor 








So, just as we defined an objective function (39.11) for the training of a 
single neuron as a classifier, we can define 


G(w) =-Y Y P ng +0- ef) mà — y!”) (42.28) 
where 
(n) 
(n) Log Se. 
E” = a 42.29 
and i 
(n) , where al”) = Y wie” : (42.30) 


Bee exp(—a!”?) 


We can then steal the algorithm (algorithm 39.5, p.478) which we wrote for 
the single neuron, to write an algorithm for optimizing a Hopfield network, 
algorithm 42.9. The convenient syntax of Octave requires very few changes; 
the extra lines enforce the constraints that the self-weights w; should all be 
zero and that the weight matrix should be symmetrical (wij = wji). 

As expected, this learning algorithm does a better job than the one-shot 
Hebbian learning rule. When the six patterns of figure 42.5, which cannot be 
memorized by the Hebb rule, are learned using algorithm 42.9, all six patterns 
become stable states. 


Exercise 42.8.1401] Implement this learning rule and investigate empirically its 
capacity for memorizing random patterns; also compare its avalanche 
properties with those of the Hebb rule. 


»> 42.9 Hopfield networks for optimization problems 


Since a Hopfield network’s dynamics minimize an energy function, it is natural 
to ask whether we can map interesting optimization problems onto Hopfield 
networks. Biological data processing problems often involve an element of 
constraint satisfaction — in scene interpretation, for example, one might wish 
to infer the spatial location, orientation, brightness and texture of each visible 
element, and which visible elements are connected together in objects. These 
inferences are constrained by the given data and by prior knowledge about 
continuity of objects. 
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Figure 42.10. Hopfield network for 
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Hopfield and Tank (1985) suggested that one might take an interesting 
constraint satisfaction problem and design the weights of a binary or contin- 
uous Hopfield network such that the settling process of the network would 
minimize the objective function of the problem. 


The travelling salesman problem 


A classic constraint satisfaction problem to which Hopfield networks have been 
applied is the travelling salesman problem. 

A set of K cities is given, and a matrix of the K(K-—1)/2 distances between 
those cities. The task is to find a closed tour of the cities, visiting each city 
once, that has the smallest total distance. The travelling salesman problem is 
equivalent in difficulty to an NP-complete problem. 

The method suggested by Hopfield and Tank is to represent a tentative so- 
lution to the problem by the state of a network with J = K? neurons arranged 
in a square, with each neuron representing the hypothesis that a particular 
city comes at a particular point in the tour. It will be convenient to consider 
the states of the neurons as being between 0 and 1 rather than —1 and 1. 
Two solution states for a four-city travelling salesman problem are shown in 
figure 42.10a. 

The weights in the Hopfield network play two roles. First, they must define 
an energy function which is minimized only when the state of the network 
represents a valid tour. A valid state is one that looks like a permutation 
matrix, having exactly one ‘1’ in every row and one ‘l’ in every column. This 
rule can be enforced by putting large negative weights between any pair of 
neurons that are in the same row or the same column, and setting a positive 
bias for all neurons to ensure that K neurons do turn on. Figure 42.10b shows 
the negative weights that are connected to one neuron, ‘B2’, which represents 
the statement ‘city B comes second in the tour’. 

Second, the weights must encode the objective function that we want 
to minimize — the total distance. This can be done by putting negative 
weights proportional to the appropriate distances between the nodes in adja- 
cent columns. For example, between the B and D nodes in adjacent columns, 
the weight would be —dgp. The negative weights that are connected to neu- 
ron B2 are shown in figure 42.10c. The result is that when the network is in 
a valid state, its total energy will be the total distance of the corresponding 
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Figure 42.11. (a) Evolution of the 
state of a continuous Hopfield 
network solving a travelling 
salesman problem using Aiyer’s 
(1991) graduated non-convexity 
method; the state of the network 
is projected into the 
two-dimensional space in which 
the cities are located by finding 
the centre of mass for each point 
in the tour, using the neuron 
activities as the mass function. 
(b) The travelling scholar 
problem. The shortest tour 
linking the 27 Cambridge 
Colleges, the Engineering 
Department, the University 
Library, and Sree Aiyer’s house. 
From Aiyer (1991). 
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tour, plus a constant given by the energy associated with the biases. 


Now, since a Hopfield network minimizes its energy, it is hoped that the 
binary or continuous Hopfield network’s dynamics will take the state to a 
minimum that is a valid tour and which might be an optimal tour. This hope 
is not fulfilled for large travelling salesman problems, however, without some 
careful modifications. We have not specified the size of the weights that enforce 
the tour’s validity, relative to the size of the distance weights, and setting this 
scale factor poses difficulties. If ‘large’ validity-enforcing weights are used, 
the network’s dynamics will rattle into a valid state with little regard for the 
distances. If ‘small’ validity-enforcing weights are used, it is possible that the 
distance weights will cause the network to adopt an invalid state that has lower 
energy than any valid state. Our original formulation of the energy function 
puts the objective function and the solution’s validity in potential conflict 
with each other. This difficulty has been resolved by the work of Sree Aiyer 
(1991), who showed how to modify the distance weights so that they would not 
interfere with the solution’s validity, and how to define a continuous Hopfield 
network whose dynamics are at all times confined to a ‘valid subspace’. Aiyer 
used a graduated non-convexity or deterministic annealing approach to find 
good solutions using these Hopfield networks. The deterministic annealing 
approach involves gradually increasing the gain @ of the neurons in the network 
from 0 to oo, at which point the state of the network corresponds to a valid 
tour. A sequence of trajectories generated by applying this method to a thirty- 
city travelling salesman problem is shown in figure 42.11a. 


A solution to the ‘travelling scholar problem’ found by Aiyer using a con- 
tinuous Hopfield network is shown in figure 42.11b. 
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> 42.10 Further exercises 


> Exercise 42.9.1] Storing two memories. 


Two binary memories m and n (m;,n; E€ {—1,+1}) are stored by Heb- 
bian learning in a Hopfield network using 


fmm + nin; fori Æ j 
a a { 0 for i = j. (4231) 


The biases b; are set to zero. 
The network is put in the state x = m. Evaluate the activation a; of 
neuron 7 and show that in can be written in the form 


Qi = UMi + Vni. (42.32) 


By comparing the signal strength, u, with the magnitude of the noise 
strength, |v|, show that x = m is a stable state of the dynamics of the 
network. 


The network is put in a state x differing in D places from m, 
x=m-+ 2d, (42.33) 


where the perturbation d satisfies d; E€ {—-1,0,+1}. D is the number 
of components of d that are non-zero, and for each d; that is non-zero, 


d; = —m,. Defining the overlap between m and n to be 
I 
Omn = 5 Mini, (42.34) 
i=1 


evaluate the activation a; of neuron i again and show that the dynamics 
of the network will restore x to m if the number of flipped bits satisfies 


1 
Dees [omn] — 2). (42.35) 


How does this number compare with the maximum number of flipped 
bits that can be corrected by the optimal decoder, assuming the vector 
x is either a noisy version of m or of n? 


Exercise 42.10.19 Hopfield network as a collection of binary classifiers. This ex- 
ercise explores the link between unsupervised networks and supervised 
networks. If a Hopfield network’s desired memories are all attracting 
stable states, then every neuron in the network has weights going to it 
that solve a classification problem personal to that neuron. Take the set 
of memories and write them in the form x’ O, where x’ denotes all 
the components 2, for all i’ 4 i, and let w’ denote the vector of weights 


Wii!» for i’ x i. 


Using what we know about the capacity of the single neuron, show that 
it is almost certainly impossible to store more than 27 random memories 
in a Hopfield network of J neurons. 
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Lyapunov functions 


Exercise 42.11.19] Erik's puzzle. In a stripped-down version of Conway’s game 
of life, cells are arranged on a square grid. Each cell is either alive or 
dead. Live cells do not die. Dead cells become alive if two or more of 
their immediate neighbours are alive. (Neighbours to north, south, east 
and west.) What is the smallest number of live cells needed in order 
that these rules lead to an entire N x N square being alive? 









































In a d-dimensional version of the same game, the rule is that if d neigh- ° ojo eje 
bours are alive then you come to life. What is the smallest number of : F : sees : : 
live cells needed in order that an entire N x N x --- x N hypercube 


becomes alive? (And how should those live cells be arranged?) 


The southeast puzzle 


i TE Po et 
Figure 42.13. The southeast 


The southeast puzzle is played on a semi-infinite chess board, starting at puzzle. 
its northwest (top left) corner. There are three rules: 





1. In the starting position, one piece is placed in the northwest-most square 
(figure 42.13a). 


2. It is not permitted for more than one piece to be on any given square. 


3. At each step, you remove one piece from the board, and replace it with 
two pieces, one in the square immediately to the east, and one in the the 
square immediately to the south, as illustrated in figure 42.13b. Every 
such step increases the number of pieces on the board by one. 


After move (b) has been made, either piece may be selected for the next move. 
Figure 42.13c shows the outcome of moving the lower piece. At the next move, 
either the lowest piece or the middle piece of the three may be selected; the 
uppermost piece may not be selected, since that would violate rule 2. At move 
(d) we have selected the middle piece. Now any of the pieces may be moved, 
except for the leftmost piece. 

Now, here is the puzzle: 


> Exercise 42.12.14 P-521] Ts it possible to obtain a position in which all the ten 
squares closest to the northwest corner, marked in figure 42.13z, are 
empty? 


[Hint: this puzzle has a connection to data compression. ] 


> 42.11 Solutions 


Solution to exercise 42.3 (p.508). Take a binary feedback network with 2 neu- 
rons and let wyz = 1 and w21 = —1. Then whenever neuron 1 is updated, 
it will match neuron 2, and whenever neuron 2 is updated, it will flip to the 
opposite state from neuron 1. There is no stable state. 
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Solution to exercise 42.4 (p.508). Take a binary Hopfield network with 2 neu- 
rons and let w 2 = w21 = 1, and let the initial condition be zı = 1, x2 = —1. 
Then if the dynamics are synchronous, on every iteration both neurons will 
flip their state. The dynamics do not converge to a fixed point. 


Solution to exercise 42.12 (p.520). The key to this problem is to notice its 
similarity to the construction of a binary symbol code. Starting from the 
empty string, we can build a binary tree by repeatedly splitting a codeword 
into two. Every codeword has an implicit probability 27}, where 1 is the 
depth of the codeword in the binary tree. Whenever we split a codeword in 
two and create two new codewords whose length is increased by one, the two 
new codewords each have implicit probability equal to half that of the old 
codeword. For a complete binary code, the Kraft equality affirms that the 
sum of these implicit probabilities is 1. 

Similarly, in southeast, we can associate a ‘weight’ with each piece on the 
board. If we assign a weight of 1 to any piece sitting on the top left square; 
a weight of 1/2 to any piece on a square whose distance from the top left is 
one; a weight of 1/4 to any piece whose distance from the top left is two; and 
so forth, with ‘distance’ being the city-block distance; then every legal move 
in southeast leaves unchanged the total weight of all pieces on the board. 
Lyapunov functions come in two flavours: the function may be a function of 
state whose value is known to stay constant; or it may be a function of state 
that is bounded below, and whose value always decreases or stays constant. 
The total weight is a Lyapunov function of the second type. 

The starting weight is 1, so now we have a powerful tool: a conserved 
function of the state. Is it possible to find a position in which the ten highest- 
weight squares are vacant, and the total weight is 1? What is the total weight Figure 42.14. A possible position 
if all the other squares on the board are occupied (figure 42.14)? The total for the southeast puzzle? 
weight would be }77°,(J + 1)2~', which is equal to 3/4. So it is impossible to 
empty all ten of those squares. 
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43 


Boltzmann Machines 


> 43.1 From Hopfield networks to Boltzmann machines 


We have noticed that the binary Hopfield network minimizes an energy func- 
tion 

E(x) = —5x'Wx (43.1) 
and that the continuous Hopfield network with activation function £n = 
tanh(an) can be viewed as approximating the probability distribution asso- 
ciated with that energy function, 


1 1 


P(x|W) = Zw) PEE] = Z(W) 


1 
exp [5x wx! 2 (43.2) 
These observations motivate the idea of working with a neural network model 
that actually implements the above probability distribution. 
The stochastic Hopfield network or Boltzmann machine (Hinton and Se- 
jnowski, 1986) has the following activity rule: 


Activity rule of Boltzmann machine: after computing the activa- 
tion a; (42.3), 


set x; = +1 with probability 


1 
Ipe 2i (43.3) 


else set x; = —1. 





This rule implements Gibbs sampling for the probability distribution (43.2). 


Boltzmann machine learning 


Given a set of examples {x from the real world, we might be interested 
in adjusting the weights W such that the generative model 


P(x|W 


1 
exp [zw] (43.4) 


_ 1 
= ZW) 


is well matched to those examples. We can derive a learning algorithm by 
writing down Bayes’ theorem to obtain the posterior probability of the weights 
given the data: 


N 


[[ Pew) 


n=1 
P({x™ }}) 


P(W) 








PCW | {x™}}) = (43.5) 
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We concentrate on the first term in the numerator, the likelihood, and derive a 
maximum likelihood algorithm (though there might be advantages in pursuing 
a full Bayesian approach as we did in the case of the single neuron). We 
differentiate the logarithm of the likelihood, 


N N 
In TI P(x” w] =^ [3O wo —InZ(W)|, (43.6) 
n=1 


n=1 
with respect to wij, bearing in mind that W is defined to be symmetric with 


Exercise 43.1.[2] Show that the derivative of In Z(W) with respect to wij is 





zg" Z(W)= X rizjP(x |W) = (ets) pee wy: (43.7) 
tJ = 


[This exercise is similar to exercise 22.12 (p.307).] 


The derivative of the log likelihood is therefore: 


N 
InP} |W) = So [22 — aawl BS 


ij n=1 


N | (its ata — (2:25) pew] . (48.9) 





This gradient is proportional to the difference of two terms. The first term is 
the empirical correlation between x; and zj, 


N 
(LiT) Data = P Ley ; (43.10) 
n=1 


and the second term is the correlation between x; and x; under the current 
model, 


(ziti) pw) = 5 xizj;P(x| W). (43.11) 


The first correlation (£i£j)pața is readily evaluated — it is just the empirical 
correlation between the activities in the real world. The second correlation, 
(Tizi) pox | w)> is not so easy to evaluate, but it can be estimated by Monte 
Carlo methods, that is, by observing the average value of xix; while the ac- 
tivity rule of the Boltzmann machine, equation (43.3), is iterated. 

In the special case W = 0, we can evaluate the gradient exactly because, 
by symmetry, the correlation (x;2;) P(x| W) must be zero. If the weights are 
adjusted by gradient descent with learning rate 7, then, after one iteration, 
the weights will be 


N 
Wij = n> Ped ; (43.12) 
n=1 


precisely the value of the weights given by the Hebb rule, equation (16.5), with 
which we trained the Hopfield network. 


Interpretation of Boltzmann machine learning 


One way of viewing the two terms in the gradient (43.9) is as ‘waking’ and 
‘sleeping’ rules. While the network is ‘awake’, it measures the correlation 
between x; and x; in the real world, and weights are increased in proportion. 
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While the network is ‘asleep’, it ‘dreams’ about the world using the generative 
model (43.4), and measures the correlations between x; and x; in the model "a "i 
























































real world, then the two terms balance and the weights do not change. E E 


n'aie" le ee 


Up to this point we have discussed Hopfield networks and Boltzmann machines (a) (b) 
in which all of the neurons correspond to visible variables x;. The result 
is a probabilistic model that, when optimized, can capture the second-order Figure 43.1. The ‘shifter’ 
statistics of the environment. [The second-order statistics of an ensemble ensembles. (a) Four samples from 
P(x) are the expected values (x;2;) of all the pairwise products x;2;.] The pia e (b) 
our corresponding samples from 
real world, however, often has higher-order correlations that must be included the labelled shifter ensemble. 
if our description of it is to be effective. Often the second-order correlations 
in themselves may carry little or no useful information. 
Consider, for example, the ensemble of binary images of chairs. We can 
imagine images of chairs with various designs — four-legged chairs, comfy 
chairs, chairs with five legs and wheels, wooden chairs, cushioned chairs, chairs 
with rockers instead of legs. A child can easily learn to distinguish these images 
from images of carrots and parrots. But I expect the second-order statistics of 
the raw data are useless for describing the ensemble. Second-order statistics 
only capture whether two pixels are likely to be in the same state as each 
other. Higher-order concepts are needed to make a good generative model of 
images of chairs. 
A simpler ensemble of images in which high-order statistics are important 
is the ‘shifter ensemble’, which comes in two flavours. Figure 43.la shows a 
few samples from the ‘plain shifter ensemble’. In each image, the bottom eight 
pixels are a copy of the top eight pixels, either shifted one pixel to the left, 
or unshifted, or shifted one pixel to the right. (The top eight pixels are set 
at random.) This ensemble is a simple model of the visual signals from the 
two eyes arriving at early levels of the brain. The signals from the two eyes 
are similar to each other but may differ by small translations because of the 
varying depth of the visual world. This ensemble is simple to describe, but its 
second-order statistics convey no useful information. The correlation between 
one pixel and any of the three pixels above it is 1/3. The correlation between 
any other two pixels is zero. 
Figure 43.1b shows a few samples from the ‘labelled shifter ensemble’. 
Here, the problem has been made easier by including an extra three neu- 
rons that label the visual image as being an instance of either the ‘shift left’, 
‘no shift’, or ‘shift right’ sub-ensemble. But with this extra information, the 
ensemble is still not learnable using second-order statistics alone. The second- 
order correlation between any label neuron and any image neuron is zero. We 
need models that can capture higher-order statistics of an environment. 
So, how can we develop such models? One idea might be to create models 
that directly capture higher-order correlations, such as: 

















E 
world; these correlations determine a proportional decrease in the weights. If E 
the second-order correlations in the dream world match the correlations in the E is E a " " 
E 
BOO 











Criticism of Hopfield networks and simple Boltzmann machines 











1 1 1 
P'(x|W,V,...)= Gi EXP | 5 2 WijLi£j + 6 >. VijkLiZljEk H: 
ij ij 


(43.13) 
Such higher-order Boltzmann machines are equally easy to simulate using 
stochastic updates, and the learning rule for the higher-order parameters Vijk 
is equivalent to the learning rule for wij. 
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> Exercise 43.2.!?] Derive the gradient of the log likelihood with respect to Vijk- 


It is possible that the spines found on biological neurons are responsible for 
detecting correlations between small numbers of incoming signals. However, 
to capture statistics of high enough order to describe the ensemble of images 
of chairs well would require an unimaginable number of terms. To capture 
merely the fourth-order statistics in a 128 x 128 pixel image, we need more 
than 10’ parameters. 

So measuring moments of images is not a good way to describe their un- 
derlying structure. Perhaps what we need instead or in addition are hidden 
variables, also known to statisticians as latent variables. This is the important 
innovation introduced by Hinton and Sejnowski (1986). The idea is that the 
high-order correlations among the visible variables are described by includ- 
ing extra hidden variables and sticking to a model that has only second-order 
interactions between its variables; the hidden variables induce higher-order 
correlations between the visible variables. 


> 43.2 Boltzmann machine with hidden units 


We now add hidden neurons to our stochastic model. These are neurons that 
do not correspond to observed variables; they are free to play any role in the 
probabilistic model defined by equation (43.4). They might actually take on 
interpretable roles, effectively performing ‘feature extraction’. 


Learning in Boltzmann machines with hidden units 


The activity rule of a Boltzmann machine with hidden units is identical to that 
of the original Boltzmann machine. The learning rule can again be derived 
by maximum likelihood, but now we need to take into account the fact that 
the states of the hidden units are unknown. We will denote the states of the 
visible units by x, the states of the hidden units by h, and the generic state 
of a neuron (either visible or hidden) by y;, with y = (x, h). The state of the 
network when the visible neurons are clamped in state x is y™ = (x, h). 
The likelihood of W given a single data example x) is 


P(x |W) = ST P(x] W) = Se exp [rwy™) 
h 





ZW) 
(43.14) 
where i 
Z(W) = ~y'Wy|. 43.1 
W) = Dew Fadl (43.15) 
Equation (43.14) may also be written 
Z,(n) (W) 
P(x |W) = L 43.1 
Gx |W) = Se (43.16) 
where ' 
ZW) = Se [sb rwy™| , (43.17) 


Differentiating the likelihood as before, we find that the derivative with re- 
spect to any weight w;j is again the difference between a ‘waking’ term and a 
‘sleeping’ term, 


ð n 
ie In P({x! DN |W) = X | Ws) Pen | low = vith) peniwy} . 
(43.18) 


n 
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The first term (yiyj) P(h| x(),W) is the correlation between y; and yj if the 


Boltzmann machine is simulated with the visible variables clamped to x”) 
and the hidden variables freely sampling from their conditional distribution. 

The second term (y;y;) P(x,h| W) is the correlation between y; and y; when 
the Boltzmann machine generates samples from its model distribution. 

Hinton and Sejnowski demonstrated that non-trivial ensembles such as 
the labelled shifter ensemble can be learned using a Boltzmann machine with 
hidden units. The hidden units take on the role of feature detectors that spot 
patterns likely to be associated with one of the three shifts. 

The Boltzmann machine is time-consuming to simulate because the compu- 
tation of the gradient of the log likelihood depends on taking the difference of 
two gradients, both found by Monte Carlo methods. So Boltzmann machines 














































































































are not in widespread use. It is an area of active research to create models 
that embody the same capabilities using more efficient computations (Hinton 
et al., 1995; Dayan et al., 1995; Hinton and Ghahramani, 1997; Hinton, 2001; H- 
Hinton and Teh, 2001). 
m 
E 
> 43.3 Exercise -mt 
: [3] i | i | 
> Exercise 43.3.!°/ Can the ‘bars and stripes’ ensemble (figure 43.2) be learned arena 
by a Boltzmann machine with no hidden units? [You may be surprised!] 
EEEH HE 






































Figure 43.2. Four samples from 
the ‘bars and stripes’ ensemble. 
Each sample is generated by first 
picking an orientation, horizontal 
or vertical; then, for each row of 
spins in that orientation (each bar 
or stripe respectively), switching 
all spins on with probability Ya. 


44 


Supervised Learning in Multilayer 
Networks 


»> 44.1 Multilayer perceptrons 


No course on neural networks could be complete without a discussion of su- 
pervised multilayer networks, also known as backpropagation networks. 

The multilayer perceptron is a feedforward network. It has input neurons, 
hidden neurons and output neurons. The hidden neurons may be arranged 
in a sequence of layers. The most common multilayer perceptrons have a 
single hidden layer, and are known as ‘two-layer’ networks, the number ‘two’ 
counting the number of layers of neurons not including the inputs. 

Such a feedforward network defines a nonlinear parameterized mapping 
from an input x to an output y = y(x;w, A). The output is a continuous 
function of the input and of the parameters w; the architecture of the net, i.e., 
the functional form of the mapping, is denoted by A. Feedforward networks 
can be ‘trained’ to perform regression and classification tasks. 


Regression networks 


In the case of a regression problem, the mapping for a network with one hidden 
layer may have the form: 
D) (44.1) 


Hidden layer: at) = Sow? ay + a hj = f(a‘ 
l 


Output layer: a) = Fw hj + oP; y= f(a) (44.2) 
j 

where, for example, f(a) = tanh(a), and f)(a) = a. Here l runs over 
the inputs 71,...,%z, 7 runs over the hidden units, and 7 runs over the out- 
puts. The ‘weights’ w and ‘biases’ 0 together make up the parameter vector 
w. The nonlinear sigmoid function f“) at the hidden layer gives the neu- 
ral network greater computational flexibility than a standard linear regression 
model. Graphically, we can represent the neural network as a set of layers of 
connected neurons (figure 44.1). 


What sorts of functions can these networks implement? 


Just as we explored the weight space of the single neuron in Chapter 39, 
examining the functions it could produce, let us explore the weight space of 
a multilayer network. In figures 44.2 and 44.3 I take a network with one 
input and one output and a large number H of hidden units, set the biases 
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Figure 44.1. A typical two-layer 
network, with six inputs, seven 

hidden units, and three outputs. 
Each line represents one weight. 
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Figure 44.2. Samples from the 
prior over functions of a one-input 
network. For each of a sequence of 
values of opias = 8, 6, 4, 3, 2, 1.6, 
1.2, 0.8, 0.4, 0.3, 0.2, and 

Jin = 50pigg5, One random function 
is shown. The other 
hyperparameters of the network 
were H = 400, oë, = 0.05. 
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Input 


and weights a wl’, gP and w? 
function y(x). I set the hidden units’ biases 0‘) to random values from a 
Gaussian with zero mean and standard deviation opjas; the input-to-hidden 
Ne to random values with standard deviation gin; and the bias and 
output weights 9) and wl? to random values with standard deviation Cout- 

The sort of functions that we obtain depend on the values of Opias, Jin 
and Cout- As the weights and biases are made bigger we obtain more complex 
functions with more features and a greater sensitivity to the input variable. 
The vertical scale of a typical function produced by the network with random 
weights is of order VHoout; the horizontal range in which the function varies 
significantly is of order Opias/@in; and the shortest horizontal length scale is of 
order 1/cin. 

Radford Neal (1996) has also shown that in the limit as H — oo the 
statistical properties of the functions generated by randomizing the weights are 
independent of the number of hidden units; so, interestingly, the complexity of 
the functions becomes independent of the number of parameters in the model. 
What determines the complexity of the typical functions is the characteristic 
magnitude of the weights. Thus we anticipate that when we fit these models to 
real data, an important way of controlling the complexity of the fitted function 


will be to control the characteristic magnitude of the weights. 


to random values, and plot the resulting 


weights w 


Figure 44.4 shows one typical function produced by a network with two 
inputs and one output. This should be contrasted with the function produced 
by a traditional linear regression model, which is a flat plane. Neural networks 
can create functions with more complexity than a linear regression. 


44.2 How a regression network is traditionally trained 


This network is trained using a data set D = {x t™} by adjusting w so as 
to minimize an error function, e.g., 


pee D L (P — wx; a | 


(44.3) 


This objective function is a sum of terms, one for each input /target pair {x,t}, 
measuring how close the output y(x; w) is to the target t. 

This minimization is based on repeated evaluation of the gradient of Ep. 
This gradient can be efficiently computed using the backpropagation algorithm 
(Rumelhart et al., 1986), which uses the chain rule to find the derivatives. 


Figure 44.3. Properties of a 
function produced by a random 
network. The vertical scale of a 
typical function produced by the 
network with random weights is of 
order VH. Sout; the horizontal 
range in which the function varies 
significantly is of order Opias/0in; 
and the shortest horizontal length 
scale is of order 1/oin. The 
function shown was produced by 
making a random network with 
H = 400 hidden units, and 
Gaussian weights with opias = 4, 
Jin = 8, and Cout = 0.5. 





Figure 44.4. One sample from the 
prior of a two-input network with 
{H, oi, bias? Tout t = 


£400, 8.0, 8.0, 0.05}. 
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Often, regularization (also known as weight decay) is included, modifying 
the objective function to: 


M(w) = BEp + aEw (44.4) 


where, for example, Ew = $ >, we. This additional term favours small values 
of w and decreases the tendency of a model to overfit noise in the training 
data. 

Rumelhart et al. (1986) showed that multilayer perceptrons can be trained, 
by gradient descent on M(w), to discover solutions to non-trivial problems 
such as deciding whether an image is symmetric or not. These networks have 
been successfully applied to real-world tasks as varied as pronouncing English 
text (Sejnowski and Rosenberg, 1987) and focussing multiple-mirror telescopes 
(Angel et al., 1990). 


> 44.3 Neural network learning as inference 


The neural network learning process above can be given the following proba- 
bilistic interpretation. [Here we repeat and generalize the discussion of Chap- 
ter 41.] 

The error function is interpreted as defining a noise model. BEp is the 
negative log likelihood: 


P(D|w,8,H) = 





a Sp pepi (44.5) 


Thus, the use of the sum-squared error Æp (44.3) corresponds to an assump- 
tion of Gaussian noise on the target variables, and the parameter ( defines a 
noise level ø? = 1/2. 

Similarly the regularizer is interpreted in terms of a log prior probability 
distribution over the parameters: 





P(w|a, H) = Z : exp(—aEw). (44.6) 


w (a) 


If Ew is quadratic as defined above, then the corresponding prior distribution 
is a Gaussian with variance o2, = 1/a. The probabilistic model H specifies 
the architecture A of the network, the likelihood (44.5), and the prior (44.6). 

The objective function M(w) then corresponds to the inference of the 
parameters w, given the data: 


P(D|w,G,H)P JH 


1 
= z-exp(-M(w). (44.8) 


The w found by (locally) minimizing M (w) is then interpreted as the (locally) 
most probable parameter vector, Wmr. 

The interpretation of M(w) as a log probability adds little new at this 
stage. But new tools will emerge when we proceed to other inferences. First, 
though, let us establish the probabilistic interpretation of classification net- 
works, to which the same tools apply. 
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Binary classification networks 


If the targets ¢ in a data set are binary classification labels (0,1), it is natural 
to use a neural network whose output y(x; w, A) is bounded between 0 and 1, 
and is interpreted as a probability P(t=1|x,w,A). For example, a network 
with one hidden layer could be described by the feedforward equations (44.1) 
and (44.2), with f()(a) = 1/(1+e~°). The error function GEp is replaced by 
the negative log likelihood: 


G(w) =- |S ot Inge w) + 1-4) na — y(x™;w))]}. (44.9) 


The total objective function is then M = G+aEy. Note that this includes 
no parameter 8 (because there is no Gaussian noise). 


Multi-class classification networks 


For a multi-class classification problem, we can represent the targets by a 
vector, t, in which a single element is set to 1, indicating the correct class, and 
all other elements are set to 0. In this case it is appropriate to use a ‘softmax’ 
network having coupled outputs which sum to one and are interpreted as 
class probabilities y; = P(t;=1|x,w,A). The last part of equation (44.2) is 
replaced by: 


e^i 
) eti! È 
i! 


The negative log likelihood in this case is 


G(w) = DDD In y(x; w). (44.11) 





Yi = (44.10) 


As in the case of the regression network, the minimization of the objective 
function M(w) = G + aEw corresponds to an inference of the form (44.8). A 
variety of useful results can be built on this interpretation. 


> 44.4 Benefits of the Bayesian approach to supervised feedforward 
neural networks 


From the statistical perspective, supervised neural networks are nothing more 
than nonlinear curve-fitting devices. Curve fitting is not a trivial task however. 
The effective complexity of an interpolating model is of crucial importance, 
as illustrated in figure 44.5. Consider a control parameter that influences the 
complexity of a model, for example a regularization constant a (weight decay 
parameter). As the control parameter is varied to increase the complexity of 
the model (descending from figure 44.5a-c and going from left to right across 
figure 44.5d), the best fit to the training data that the model can achieve 
becomes increasingly good. However, the empirical performance of the model, 
the test error, first decreases then increases again. An over-complez model 
overfits the data and generalizes poorly. This problem may also complicate 
the choice of architecture in a multilayer perceptron, the radius of the basis 
functions in a radial basis function network, and the choice of the input vari- 
ables themselves in any multidimensional regression problem. Finding values 
for model control parameters that are appropriate for the data is therefore an 
important and non-trivial problem. 
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Model Control Parameters 


The overfitting problem can be solved by using a Bayesian approach to 
control model complexity. 

If we give a probabilistic interpretation to the model, then we can evaluate 
the evidence for alternative values of the control parameters. As was explained 
in Chapter 28, over-complex models turn out to be less probable, and the 
evidence P(Data| Control Parameters) can be used as an objective function 
for optimization of model control parameters (figure 44.5e). The setting of a 
that maximizes the evidence is displayed in figure 44.5b. 

Bayesian optimization of model control parameters has four important ad- 
vantages. (1) No ‘test set’ or ‘validation set’ is involved, so all available training 
data can be devoted to both model fitting and model comparison. (2) Reg- 
ularization constants can be optimized on-line, i.e., simultaneously with the 
optimization of ordinary model parameters. (3) The Bayesian objective func- 
tion is not noisy, in contrast to a cross-validation measure. (4) The gradient of 
the evidence with respect to the control parameters can be evaluated, making 
it possible to simultaneously optimize a large number of control parameters. 

Probabilistic modelling also handles uncertainty in a natural manner. It 
offers a unique prescription, marginalization, for incorporating uncertainty 
about parameters into predictions; this procedure yields better predictions, as 
we saw in Chapter 41. Figure 44.6 shows error bars on the predictions of a 
trained neural network. 


Implementation of Bayesian inference 


As was mentioned in Chapter 41, Bayesian inference for multilayer networks 
may be implemented by Monte Carlo sampling, or by deterministic methods 
employing Gaussian approximations (Neal, 1996; MacKay, 1992c). 


Figure 44.5. Optimization of 
model complexity. Panels (a-c) 
show a radial basis function model 
interpolating a simple data set 
with one input variable and one 
output variable. As the 
regularization constant is varied 
to increase the complexity of the 
model (from (a) to (c)), the 
interpolant is able to fit the 
training data increasingly well, 
but beyond a certain point the 
generalization ability (test error) 
of the model deteriorates. 
Probability theory allows us to 
optimize the control parameters 
without needing a test set. 





Figure 44.6. Error bars on the 
predictions of a trained regression 
network. The solid line gives the 
predictions of the best-fit 
parameters of a multilayer 
perceptron trained on the data 
points. The error bars (dotted 
ines) are those produced by the 
uncertainty of the parameters w. 
Notice that the error bars become 
arger where the data are sparse. 
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Within the Bayesian framework for data modelling, it is easy to improve 
our probabilistic models. For example, if we believe that some input variables 
in a problem may be irrelevant to the predicted quantity, but we don’t know 
which, we can define a new model with multiple hyperparameters that captures 
the idea of uncertain input variable relevance (MacKay, 1994b; Neal, 1996; 
MacKay, 1995b); these models then infer automatically from the data which 
are the relevant input variables for a problem. 


> 44.5 Exercises 


Exercise 44.1.4] How to measure a classifier’s quality. You’ve just written a new 
classification algorithm and want to measure how well it performs on a test 
set, and compare it with other classifiers. What performance measure should 
you use? There are several standard answers. Let’s assume the classifier gives 
an output y(x), where x is the input, which we won’t discuss further, and that 
the true target value is t. In the simplest discussions of classifiers, both y and 
t are binary variables, but you might care to consider cases where y and t are 
more general objects also. 

The most widely used measure of performance on a test set is the error 
rate — the fraction of misclassifications made by the classifier. This measure 
forces the classifier to give a 0/1 output and ignores any additional information 
that the classifier might be able to offer — for example, an indication of the 
firmness of a prediction. Unfortunately, the error rate does not necessarily 
measure how informative a classifier’s output is. Consider frequency tables 
showing the joint frequency of the 0/1 output of a classifier (horizontal axis), 
and the true 0/1 variable (vertical axis). The numbers that we’ll show are 
percentages. The error rate e is the sum of the two off-diagonal numbers, 
which we could call the false positive rate e+ and the false negative rate e_. 

Of the following three classifiers, A and B have the same error rate of 10% 
and C has a greater error rate of 12%. 


Classifier A Classifier B Classifier C 
y| 0 1 
t 
0 90 0 
1 10 0 





But clearly classifier A, which simply guesses that the outcome is 0 for all 
cases, is conveying no information at all about t; whereas classifier B has an 
informative output: if y = 0 then we are sure that t really is zero; and if y= 1 
then there is a 50% chance that t=1, as compared to the prior probability 
P(t=1) = 0.1. Classifier C is slightly less informative than B, but it is still 


much more useful than the information-free classifier A. How common sense ranks the 
One way to improve on the error rate as a performance measure is to report classifiers: 

the pair (e4,e_), the false positive error rate and the false negative error rate, (best) B > C > A (worst). 

which are (0,0.1) and (0.1,0) for classifiers A and B. It is especially important How citer Pate traiks the 

to distinguish between these two error probabilities in applications where the classifiers: 

two sorts of error have different associated costs. However, there are a couple (best) A = B > C (worst). 


of problems with the ‘error rate pair’: 


e First, if I simply told you that classifier A has error rates (0,0.1) and B 
has error rates (0.1, 0), it would not be immediately evident that classifier 
A is actually utterly worthless. Surely we should have a performance 
measure that gives the worst possible score to A! 
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e Second, if we turn to a multiple-class classification problem such as digit 
recognition, then the number of types of error increases from two to 
10 x 9 = 90 — one for each possible confusion of class t with t’. It would 
be nice to have some sensible way of collapsing these 90 numbers into a 
single rankable number that makes more sense than the error rate. 


Another reason for not liking the error rate is that it doesn’t give a classifier 
credit for accurately specifying its uncertainty. Consider classifiers that have 
three outputs available, ‘0’, ‘1’ and a rejection class, ‘?’, which indicates that 
the classifier is not sure. Consider classifiers D and E with the following 
frequency tables, in percentages: 


Classifier D Classifier E 





Both of these classifiers have (e+, e_,r) = (6%, 0%, 11%). But are they equally 
good classifiers? Compare classifier E with C. The two classifiers are equiva- 
lent. E is just C in disguise — we could make E by taking the output of C and 
tossing a coin when C says ‘1’ in order to decide whether to give output ‘1’ or 
‘2’, So E is equal to C and thus inferior to B. Now compare D with B. Can 
you justify the suggestion that D is a more informative classifier than B, and 
thus is superior to E? Yet D and E have the same (e+, e_,1r) scores. 

People often plot error-reject curves (also known as ROC curves; ROC 
stands for ‘receiver operating characteristic’) which show the total e = (e+ + 
e_) versus r as r is allowed to vary from 0 to 1, and use these curves to 
compare classifiers (figure 44.7). [In the special case of binary classification 
problems, e+ may be plotted versus e— instead.] But as we have seen, error 






Error rate 


Rejection rate 
rates can be undiscerning performance measures. Does plotting one error rate 
as a function of another make this weakness of error rates go away? Figure 44.7. An error-reject curve. 
For this exercise, either construct an explicit example demonstrating that Some people use the area under 
the error-reject. curve, and the area under it, are not necessarily good ways to this curve as a measure of 
compare classifiers; or prove that they are. classifier quality. 
As a suggested alternative method for comparing classifiers, consider the 
mutual information between the output and the target, 


P(t) 


I(T;Y) = H(T)- H(T|Y) = 2 Pw) P(t | y) 108 Bary 


(44.12) 


which measures how many bits the classifier’s output conveys about the target. 
Evaluate the mutual information for classifiers A-E above. 
Investigate this performance measure and discuss whether it is a useful 
one. Does it have practical drawbacks? 
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About Chapter 45 


Feedforward neural networks such as multilayer perceptrons are popular tools 
for nonlinear regression and classification problems. From a Bayesian per- 
spective, a choice of a neural network model can be viewed as defining a prior 
probability distribution over nonlinear functions, and the neural network’s 
learning process can be interpreted in terms of the posterior probability dis- 
tribution over the unknown function. (Some learning algorithms search for the 
function with maximum posterior probability and other Monte Carlo methods 
draw samples from this posterior probability.) 

In the limit of large but otherwise standard networks, Neal (1996) has 
shown that the prior distribution over nonlinear functions implied by the 
Bayesian neural network falls in a class of probability distributions known 
as Gaussian processes. The hyperparameters of the neural network model 
determine the characteristic lengthscales of the Gaussian process. Neal’s ob- 
servation motivates the idea of discarding parameterized networks and working 
directly with Gaussian processes. Computations in which the parameters of 
the network are optimized are then replaced by simple matrix operations using 
the covariance matrix of the Gaussian process. 

In this chapter I will review work on this idea by Williams and Rasmussen 
(1996), Neal (1997b), Barber and Williams (1997) and Gibbs and MacKay 
(2000), and will assess whether, for supervised regression and classification 
tasks, the feedforward network has been superceded. 


> Exercise 45.1.19] 1 regret that this chapter is rather dry. There’s no simple 
explanatory examples in it, and few pictures. This exercise asks you to 
create interesting pictures to explain to yourself this chapter’s ideas. 


Source code for computer demonstrations written in the free language 
octave is available at: 
http://www.inference.phy.cam.ac.uk/mackay/itprnn/software.html. 

Radford Neal’s software for Gaussian processes is available at: 
http://www.cs.toronto.edu/*radford/. 
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45 


Gaussian Processes 


After the publication of Rumelhart, Hinton and Williams’s (1986) paper on 
supervised learning in neural networks there was a surge of interest in the 
empirical modelling of relationships in high-dimensional data using nonlinear 
parametric models such as multilayer perceptrons and radial basis functions. 
In the Bayesian interpretation of these modelling methods, a nonlinear func- 
tion y(x) parameterized by parameters w is assumed to underlie the data 
{x ta}, and the adaptation of the model to the data corresponds to an 
inference of the function given the data. We will denote the set of input vectors 
by Xy = {xA and the set of corresponding target values by the vector 
tv = ftnt Es The inference of y(x) is described by the posterior probability 
distribution 
_ Pltw |y), Xw)P(y(x)) 


P(y(x) |tw, Xn) = Plt Xn) . (45.1) 





Of the two terms on the right-hand side, the first, P(tyn | y(x), Xn), is the 
probability of the target values given the function y(x), which in the case of 
regression problems is often assumed to be a separable Gaussian distribution; 
and the second term, P(y(x)), is the prior distribution on functions assumed 
by the model. This prior is implicit in the choice of parametric model and 
the choice of regularizers used during the model fitting. The prior typically 
specifies that the function y(x) is expected to be continuous and smooth, 
and has less high frequency power than low frequency power, but the precise 
meaning of the prior is somewhat obscured by the use of the parametric model. 

Now, for the prediction of future values of t, all that matters is the as- 
sumed prior P(y(x)) and the assumed noise model P(ty | y(x), Xn) — the 
parameterization of the function y(x; w) is irrelevant. 

The idea of Gaussian process modelling is to place a prior P(y(x)) directly 
on the space of functions, without parameterizing y(x). The simplest type of 
prior over functions is called a Gaussian process. It can be thought of as the 
generalization of a Gaussian distribution over a finite vector space to a function 
space of infinite dimension. Just as a Gaussian distribution is fully specified 
by its mean and covariance matrix, a Gaussian process is specified by a mean 
and a covariance function. Here, the mean is a function of x (which we will 
often take to be the zero function), and the covariance is a function C(x, x’) 
that expresses the expected covariance between the values of the function y 
at the points x and x’. The function y(x) in any one data modelling problem 
is assumed to be a single sample from this Gaussian distribution. Gaussian 
processes are already well established models for various spatial and temporal 
problems — for example, Brownian motion, Langevin processes and Wiener 
processes are all examples of Gaussian processes; Kalman filters, widely used 
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to model speech waveforms, also correspond to Gaussian process models; the 
method of ‘kriging’ in geostatistics is a Gaussian process regression method. 


Reservations about Gaussian processes 


It might be thought that it is not possible to reproduce the interesting prop- 
erties of neural network interpolation methods with something so simple as a 
Gaussian distribution, but as we shall now see, many popular nonlinear inter- 
polation methods are equivalent to particular Gaussian processes. (I use the 
term ‘interpolation’ to cover both the problem of ‘regression’ — fitting a curve 
through noisy data — and the task of fitting an interpolant that passes exactly 
through the given data points.) 

It might also be thought that the computational complexity of inference 
when we work with priors over infinite-dimensional function spaces might be 
infinitely large. But by concentrating on the joint probability distribution of 
the observed data and the quantities we wish to predict, it is possible to make 
predictions with resources that scale as polynomial functions of N, the number 
of data points. 


»> 45.1 Standard methods for nonlinear regression 


The problem 


We are given N data points Xy,ty = fh) ta es The inputs x are vec- 
tors of some fixed input dimension J. The targets t are either real numbers, 
in which case the task will be a regression or interpolation task, or they are 
categorical variables, for example t € {0,1}, in which case the task is a clas- 
sification task. We will concentrate on the case of regression for the time 
being. 

Assuming that a function y(x) underlies the observed data, the task is to 
infer the function from the given data, and predict the function’s value — or 
the value of the observation ty4 1 — at a new point x(Nt)) 


Parametric approaches to the problem 


In a parametric approach to regression we express the unknown function y(x) 
in terms of a nonlinear function y(x; w) parameterized by parameters w. 


Example 45.2. Fixed basis functions. Using a set of basis functions {¢n(x)H_,, 
we can write 


H 
y(x; w) = $ wadn(x)- (45.2) 
h=1 


If the basis functions are nonlinear functions of x such as radial basis 
functions centred at fixed points {c,}#_,, 

(x a =] 
a, ? 


(45.3) 


Ph(x) = exp |- 


then y(x; w) is a nonlinear function of x; however, since the dependence 
of y on the parameters w is linear, we might sometimes refer to this as 
a ‘linear’ model. In neural network terms, this model is like a multilayer 
network whose connections from the input layer to the nonlinear hidden 
layer are fixed; only the output weights w are adaptive. 


Other possible sets of fixed basis functions include polynomials such as 
Onley = x? as where p and q are integer powers that depend on h. 
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Example 45.3. Adaptive basis functions. Alternatively, we might make a func- 
tion y(x) from basis functions that depend on additional parameters 
included in the vector w. In a two-layer feedforward neural network 
with nonlinear hidden units and a linear output, the function can be 
written 


= du we ) tanh (>: wP ri +w @) + we ) (45.4) 


where J is the dimensionality of the input space and the weight vector 
w consists of the input weights {wi}, the hidden unit biases {w?}, 
the output weights {w} and the output bias we), In this model, the 
dependence of y on w is nonlinear. 


Having chosen the parameterization, we then infer the function y(x; w) by 
inferring the parameters w. The posterior probability of the parameters is 


P(twn | w, Xv) Pw) 


P(w|tn, Xn) = P(tw|Xw) 





(45.5) 


The factor P(ty |w,Xy) states the probability of the observed data points 
when the parameters w (and hence, the function y) are known. This proba- 
bility distribution is often taken to be a separable Gaussian, each data point 
tn differing from the underlying value y(x; w) by additive noise. The factor 
P(w) specifies the prior probability distribution of the parameters. This too 
is often taken to be a separable Gaussian distribution. If the dependence of y 
on w is nonlinear the posterior distribution P(w |tw, Xx) is in general not a 
Gaussian distribution. 

The inference can be implemented in various ways. In the Laplace method, 
we minimize an objective function 


M(w) = —In[P(tw |w, Xn)P(w)| (45.6) 


with respect to w, locating the locally most probable parameters, then use the 
curvature of M, 0?M(w)/dw;,0w;, to define error bars on w. Alternatively we 
can use more general Markov chain Monte Carlo techniques to create samples 
from the posterior distribution P(w |t, Xn). 

Having obtained one of these representations of the inference of w given 
the data, predictions are then made by marginalizing over the parameters: 


P(tn+iltn,Xn41) = [at P(tyyi|w,xtY)P(w|ty,Xy)- (45.7) 


If we have a Gaussian representation of the posterior P(w|t yj, Xy), then this 
integral can typically be evaluated directly. In the alternative Monte Carlo 
approach, which generates R samples w‘") that are intended to be samples 
from the posterior distribution P(w |tw, Xn), we approximate the predictive 
distribution by 


R 


1 
Pliw+1|tw, X41) = F X P(twaa |w, x9). (45.8) 
r=1 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


538 45 — Gaussian Processes 


Nonparametric approaches. 


In nonparametric methods, predictions are obtained without explicitly pa- 
rameterizing the unknown function y(x); y(x) lives in the infinite-dimensional 
space of all continuous functions of x. One well known nonparametric ap- 
proach to the regression problem is the spline smoothing method (Kimeldorf 
and Wahba, 1970). A spline solution to a one-dimensional regression problem 
can be described as follows: we define the estimator of y(x) to be the function 
Q(x) that minimizes the functional 


N 
MWA = 58 ye) ~ tm)? + 5a fede ya), (45.9) 
n=1 


where y) is the pth derivative of y and p is a positive number. If p is set to 
2 then the resulting function #(x) is a cubic spline, that is, a piecewise cubic 
function that has ‘knots’ — discontinuities in its second derivative — at the data 
points {a}, 

This estimation method can be interpreted as a Bayesian method by iden- 
tifying the prior for the function y(x) as: 


In P(y(x) |@) = = a fax [uP (x)]? + const, (45.10) 


and the probability of the data measurements ty = TiN assuming inde- 
pendent Gaussian noise as: 


N 
InP (ty | y(x),@) = -5 BX > (y(@) = tn)? + const. (45.11) 
n=1 


[The constants in equations (45.10) and (45.11) are functions of a and £ re- 
spectively. Strictly the prior (45.10) is improper since addition of an arbitrary 
polynomial of degree (p — 1) to y(x) is not constrained. This impropriety is 
easily rectified by the addition of (p—1) appropriate terms to (45.10).] Given 
this interpretation of the functions in equation (45.9), M(y(x)) is equal to mi- 
nus the log of the posterior probability P(y(x)|tẸĒ, a, 8), within an additive 
constant, and the splines estimation procedure can be interpreted as yielding 
a Bayesian MAP estimate. The Bayesian perspective allows us additionally 
to put error bars on the splines estimate and to draw typical samples from 
the posterior distribution, and it gives an automatic method for inferring the 
hyperparameters a and (. 


Comments 
Splines priors are Gaussian processes 


The prior distribution defined in equation (45.10) is our first example of a 
Gaussian process. Throwing mathematical precision to the winds, a Gaussian 
process can be defined as a probability distribution on a space of functions 
y(x) that can be written in the form 


1 


zO) — ule) Alle) = w(a))}, (45.12) 





Ply(e) | u(2), A) = exp 


where u(x) is the mean function and A is a linear operator, and where the inner 
product of two functions y(x)'z(x) is defined by, for example, f dx y(x)z(z). 
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Here, if we denote by D the linear operator that maps y(x) to the derivative 
of y(x), we can write equation (45.10) as 


In P(y(x) | a) = -5 a fav [D?y(x)|? + const = -5 y(x)'Ay(x) + const, 


(45.13) 

which has the same form as equation (45.12) with u(x) = 0, and A = [DP] D”. 

In order for the prior in equation (45.12) to be a proper prior, A must be a 

positive definite operator, i.e., one satisfying y(x)'Ay(x) > 0 for all functions 
y(x) other than y(x) = 0. 


Splines can be written as parametric models 


Splines may be written in terms of an infinite set of fixed basis functions, as in 
equation (45.2), as follows. First rescale the x axis so that the interval (0,27) 
is much wider than the range of x values of interest. Let the basis functions 
be a Fourier set {cos ha, sinha, h=0,1,2,...}, so the function is 





ylz) = XC wa(cos) cos(ha) + 5 Wh(sin) Sin(ha). (45.14) 
h=0 h=1 
Use the regularizer 
= 1 P 2 Sn 1 RB 2 
Ey (w) = $ 5h? Whos) + >) 5h? Whim) (45.15) 
h=0 h=1 
to define a Gaussian prior on w, 
1 
P(w|a) = exp(—aEw). 45.16 
(wia) = zg Paw) (45.16) 


If p=2 then we have the cubic splines regularizer Ew(w)= f y® (x)? da, as 
in equation (45.9); if p=1 we have the regularizer Ew(w) = fy (a)? da, 
etc. (To make the prior proper we must add an extra regularizer on the 
term Wo(cos):) Thus in terms of the prior P(y(a)) there is no fundamental 
difference between the ‘nonparametric’ splines approach and other parametric 
approaches. 


Representation is irrelevant for prediction 


From the point of view of prediction at least, there are two objects of inter- 
est. The first is the conditional distribution P(ty41|tn,Xn1+1) defined in 
equation (45.7). The other object of interest, should we wish to compare one 
model with others, is the joint probability of all the observed data given the 
model, the evidence P(t |X), which appeared as the normalizing constant 
in equation (45.5). Neither of these quantities makes any reference to the rep- 
resentation of the unknown function y(x). So at the end of the day, our choice 
of representation is irrelevant. 

The question we now address is, in the case of popular parametric models, 
what form do these two quantities take? We will see that for standard models 
with fixed basis functions and Gaussian distributions on the unknown parame- 
ters, the joint probability of all the observed data given the model, P(t n | Xj), 
is a multivariate Gaussian distribution with mean zero and with a covariance 
matrix determined by the basis functions; this implies that the conditional 
distribution P(tn+1 |t, XẸĒn+1) is also a Gaussian distribution, whose mean 
depends linearly on the values of the targets ty. Standard parametric models 
are simple examples of Gaussian processes. 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


540 45 — Gaussian Processes 


> 45.2 From parametric models to Gaussian processes 


Linear models 


Let us consider a regression problem using H fixed basis functions, for example 
one-dimensional radial basis functions as defined in equation (45.3). 

Let us assume that a list of N input points {x} has been specified and 
define the N x H matrix R to be the matrix of values of the basis functions 


{or x) HL at the points {xn}, 
Ran = on(x™). (45.17) 


We define the vector yy to be the vector of values of y(x) at the N points, 


Yn = So Ranun. (45.18) 
h 


If the prior distribution of w is Gaussian with zero mean, 
P(w) = Normal(w; 0, 021), (45.19) 


then y, being a linear function of w, is also Gaussian distributed, with mean 
zero. The covariance matrix of y is 


Q = (yy') = (Rww'R’) = R (ww) R' (45.20) 
= QRR. (45.21) 


So the prior distribution of y is: 
P(y) = Normal(y; 0, Q) = Normal(y; 0, 02 RR"). (45.22) 


This result, that the vector of N function values y has a Gaussian distribu- 
tion, is true for any selected points Xy. This is the defining property of a 
Gaussian process. The probability distribution of a function y(x) is a Gaus- 
sian process if for any finite selection of points x™,x®),...,x), the density 
P(y(x), y(x),...,y(x™)) is a Gaussian. 

Now, if the number of basis functions H is smaller than the number of 
data points N, then the matrix Q will not have full rank. In this case the 
probability distribution of y might be thought of as a flat elliptical pancake 
confined to an H-dimensional subspace in the N-dimensional space in which 
y lives. 

What about the target values? If each target t, is assumed to differ by 
additive Gaussian noise of variance g2 from the corresponding function value 
Yn then t also has a Gaussian prior distribution, 


P(t) = Normal(t; 0, Q + 071). (45.23) 
We will denote the covariance matrix of t by C: 
C=Q4+07I=02RR'4+ 071. (45.24) 


Whether or not Q has full rank, the covariance matrix C has full rank since 
021 is full rank. 
What does the covariance matrix Q look like? In general, the (n,n’) entry 
of Q is 
Qnn = [FRR nn = 02, > one )on(x™) (45.25) 
h 
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and the (n, n’) entry of C is 


Cont = 02, X bn (xn () + Sno, (45.26) 
h 


where dyn: = 1 if n = n’ and 0 otherwise. 


Example 45.4. Let’s take as an example a one-dimensional case, with radial 
basis functions. The expression for Qnn’ becomes simplest if we assume we 
have uniformly-spaced basis functions with the basis function labelled h cen- 
tred on the point x = h, and take the limit H — oo, so that the sum over 
h becomes an integral; to avoid having a covariance that diverges with H, 
we had better make o?, scale as S/(AH), where AH is the number of basis 
functions per unit length of the z-axis, and S is a constant; then 


hmax 
Qu = S I dh on (a) op (a™) (45.27) 
hmin 
hmax (n) = 2 (n’) oa 2 
B (x h) (x h) 





If we let the limits of integration be too, we can solve this integral: 


(n’) _ »(n))2 
ee | a 


Qnw = Varr? S exp ZA 
4r2 


We are arriving at a new perspective on the interpolation problem. Instead of 
specifying the prior distribution on functions in terms of basis functions and 
priors on parameters, the prior can be summarized simply by a covariance 
function, 


(m) _ z(n)\2 
aoe] l TE 


C(x, 2) = 6, exp - T2 
z 


where we have given a new name, 01, to the constant out front. 

Generalizing from this particular case, a vista of interpolation methods 
opens up. Given any valid covariance function C(x,x’) — we’ll discuss in 
a moment what ‘valid’ means — we can define the covariance matrix for N 
function values at locations X y to be the matrix Q given by 


Qrw = CR) (45.31) 


and the covariance matrix for N corresponding target values, assuming Gaus- 
sian noise, to be the matrix C given by 


Cant = C(x, x) + oôn. (45.32) 
In conclusion, the prior probability of the N target values t in the data set is: 
P(t) = Normal(t; 0, C) = u a, (45.33) 


Samples from this Gaussian process and a few other simple Gaussian processes 
are displayed in figure 45.1. 
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ol a Figure 45.1. Samples drawn from 
i Gaussian process priors. Each 
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Multilayer neural networks and Gaussian processes 


Figures 44.2 and 44.3 show some random samples from the prior distribution 
over functions defined by a selection of standard multilayer perceptrons with 
large numbers of hidden units. Those samples don’t seem a million miles away 
from the Gaussian process samples of figure 45.1. And indeed Neal (1996) 
showed that the properties of a neural network with one hidden layer (as 
in equation (45.4)) converge to those of a Gaussian process as the number of 
hidden neurons tends to infinity, if standard ‘weight decay’ priors are assumed. 
The covariance function of this Gaussian process depends on the details of the 
priors assumed for the weights in the network and the activation functions of 
the hidden units. 


»> 45.3 Using a given Gaussian process model in regression 


We have spent some time talking about priors. We now return to our data 
and the problem of prediction. How do we make predictions with a Gaussian 
process? 

Having formed the covariance matrix C defined in equation (45.32) our task 
is to infer ty41 given the observed vector ty. The joint density P(ty+1, tw) 
is a Gaussian; so the conditional distribution 


P(tnw4i,tw) 


(45.34) 


is also a Gaussian. We now distinguish between different sizes of covariance 
matrix C with a subscript, such that C y+: is the (N +1) x (N +1) covariance 
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matrix for the vector ty41 = (t1,...,¢tw41)'. We define submatrices of Cry +1 
as follows: 
Cn k 
Cn = : (45.35) 


[kJ [«] 


The posterior distribution (45.34) is given by 


Plen ltn) oc exp |= [tw ta] Caa |, J]. (45.36) 
We can evaluate the mean and standard deviation of the posterior distribution 
of ty+1 by brute-force inversion of Cy+1. There is a more elegant expression 
for the predictive distribution, however, which is useful whenever predictions 
are to be made at a number of new points on the basis of the data set of size 
N. We can write Chha in terms of Cy and Cy using the partitioned inverse 
equations (Barnett, 1979): 


Cri = | Be a (45.37) 
where 
m = («—k?C 1k) * (45.38) 
m = -mC k (45.39) 
M = Cyl + mm", (45.40) 


When we substitute this matrix into equation (45.36) we find 


ONE Z oxp | (45.41) 
tN+1 
where 
iver = k'Cy'tw (45.42) 
of = k- k"Cy'k. (45.43) 





The predictive mean at the new point is given by ty41 and OP ais defines the 
error bars on this prediction. Notice that we do not need to invert Cy4, in 
order to make predictions at x(Y+), Only Cy needs to be inverted. Thus 
Gaussian processes allow one to implement a model with a number of basis 
functions H much larger than the number of data points N, with the com- 
putational requirement being of order NÌ, independent of H. [Wel discuss 
ways of reducing this cost later.] 

The predictions produced by a Gaussian process depend entirely on the 
covariance matrix C. We now discuss the sorts of covariance functions one 
might choose to define C, and how we can automate the selection of the 
covariance function in response to data. 


> 45.4 Examples of covariance functions 


The only constraint on our choice of covariance function is that it must gen- 
erate a non-negative-definite covariance matrix for any set of points fany 
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We will denote the parameters of a covariance function by 0. The covariance 
matrix of t has entries given by 


Cmn = C(x™, x; 0) + SmnN (x™; 0) (45.44) 


where C is the covariance function and M is a noise model which might be 
stationary or spatially varying, for example, 


03 for input-independent noise 


exp (oe 8;0;(x)) for input-dependent noise. (45.45) 


wae) = 


The continuity properties of C determine the continuity properties of typical 
samples from the Gaussian process prior. An encyclopaedic paper on Gaus- 
sian processes giving many valid covariance functions has been written by 
Abrahamsen (1997). 


Stationary covariance functions 


A stationary covariance function is one that is translation invariant in that it 
satisfies 


C(x,x’;0) = D(x — x’; 0) (45.46) 


for some function D, i.e., the covariance is a function of separation only, also 
known as the autocovariance function. If additionally C depends only on the 
magnitude of the distance between x and x’ then the covariance function is said 
to be homogeneous. Stationary covariance functions may also be described in 
terms of the Fourier transform of the function D, which is known as the power 
spectrum of the Gaussian process. This Fourier transform is necessarily a 
positive function of frequency. One way of constructing a valid stationary 
covariance function is to invent a positive function of frequency and define D 
to be its inverse Fourier transform. 


Example 45.5. Let the power spectrum be a Gaussian function of frequency. 
Since the Fourier transform of a Gaussian is a Gaussian, the autoco- 
variance function corresponding to this power spectrum is a Gaussian 
function of separation. This argument rederives the covariance function 
we derived at equation (45.30). 


Generalizing slightly, a popular form for C with hyperparameters 0 = 
(01, 02, {ri} is 


1 zi — x)? 
C(x, x’; 0) = 6, exp -+ DS ieee cv + 02. (45.47) 


2 
ri 
i=1 ¢ 





x is an -dimensional vector and r; is a lengthscale associated with input x;, the 
lengthscale in direction i on which y is expected to vary significantly. A very 
large lengthscale means that y is expected to be essentially a constant function 
of that input. Such an input could be said to be irrelevant, as in the automatic 
relevance determination method for neural networks (MacKay, 1994a; Neal, 
1996). The 0, hyperparameter defines the vertical scale of variations of a 
typical function. The 62 hyperparameter allows the whole function to be 
offset away from zero by some unknown constant — to understand this term, 
examine equation (45.25) and consider the basis function $(x) = 1. 
Another stationary covariance function is 


C(x, 2’) =exp(—|x—2'|\”") O<v<2. (45.48) 


45.5: Adaptation of Gaussian process models 
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For v = 2, this is a special case of the previous covariance function. For 
v € (1,2), the typical functions from this prior are smooth but not analytic 
functions. For v < 1 typical functions are continuous but not smooth. 

A covariance function that models a function that is periodic with known 
period À; in the it? input direction is 


1 sin (Eli — x1) 
22 fri 


7 


2 


C(x, x’; 0) = 01 exp (45.49) 





Figure 45.1 shows some random samples drawn from Gaussian processes 
with a variety of different covariance functions. 


Nonstationary covariance functions 


The simplest nonstationary covariance function is the one corresponding to a 
linear trend. Consider the plane y(x) = >>, wizi + c. If the {w;} and c have 
Gaussian distributions with zero mean and variances o2, and o? respectively 
then the plane has a covariance function 


I 
Cin Geax oye) = 5 o2 rix, + o2. (45.50) 
i=1 


An example of random sample functions incorporating the linear term can be 
seen in figure 45.1d. 


45.5 Adaptation of Gaussian process models 


Let us assume that a form of covariance function has been chosen, but that it 
depends on undetermined hyperparameters 0. We would like to ‘learn’ these 
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Figure 45.2. Multimodal 
likelihood functions for Gaussian 
processes. A data set of five 
points is modelled with the simple 
covariance function (45.47), with 
one hyperparameter 43 controlling 
the noise variance. Panels a and b 
show the most probable 
interpolant and its lo error bars 
when the hyperparameters 0 are 
set to two different values that 
(locally) maximize the likelihood 
P(tn | Xw,9): (a) rı = 0.95, 

63 = 0.0; (b) rı = 3.5, 03 = 3.0. 
Panel c shows a contour plot of 
the likelihood as a function of rı 
and 63, with the two maxima 
shown by crosses. From Gibbs 
(1997). 
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hyperparameters from the data. This learning process is equivalent to the 
inference of the hyperparameters of a neural network, for example, weight 
decay hyperparameters. It is a complexity-control problem, one that is solved 
nicely by the Bayesian Occam’s razor. 

Ideally we would like to define a prior distribution on the hyperparameters 
and integrate over them in order to make our predictions, i.e., we would like 
to find 


P(tn41 |xnv41,D) = f Pinn |xn+1,80, D)P(0 |D) dé. (45.51) 


But this integral is usually intractable. There are two approaches we can take. 


1. We can approximate the integral by using the most probable values of 
hyperparameters. 


P(tn41 |xnv41,D) aa P(tnsi |xnv+41,D, Our) (45.52) 


2. Or we can perform the integration over 0 numerically using Monte Carlo 
methods (Williams and Rasmussen, 1996; Neal, 1997b). 


Either of these approaches is implemented most efficiently if the gradient 
of the posterior probability of 0 can be evaluated. 


Gradient 
The posterior probability of 0 is 
P(O|D) x P(tn | Xw,9)P(8). (45.53) 


The log of the first term (the evidence for the hyperparameters) is 
1 Tinan N 
In P(ty|XẸx,0)= — 3 Indet Cn- gtnCn tn — z P2, (45.54) 


and its derivative with respect to a hyperparameter 0 is 


_,OCn n- 
N -pg CN tw. 


(45.55) 


Fil" Plty [Xn 0) = -5 Trace ( N96 ) + 5th 


Comments 


Assuming that finding the derivatives of the priors is straightforward, we can 
now search for Omp. However there are two problems that we need to be aware 
of. Firstly, as illustrated in figure 45.2, the evidence may be multimodal. 
Suitable priors and sensible optimization strategies often eliminate poor op- 
tima. Secondly and perhaps most importantly the evaluation of the gradi- 
ent of the log likelihood requires the evaluation of Co- Any exact inversion 
method (such as Cholesky decomposition, LU decomposition or Gauss-Jordan 
elimination) has an associated computational cost that is of order N3 and so 
calculating gradients becomes time consuming for large training data sets. Ap- 
proximate methods for implementing the predictions (equations (45.42) and 
(45.43)) and gradient computation (equation (45.55)) are an active research 
area. One approach based on the ideas of Skilling (1993) makes approxima- 
tions to C~'t and Trace C~! using iterative methods with cost O(N?) (Gibbs 
and MacKay, 1996; Gibbs, 1997). Further references on this topic are given 
at the end of the chapter. 
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> 45.6 Classification 


Gaussian processes can be integrated into classification modelling once we 
identify a variable that can sensibly be given a Gaussian process prior. 

In a binary classification problem, we can define a quantity an = a(x(”)) 
such that the probability that the class is 1 rather than 0 is 


1 


Pilg = l On) sa 


(45.56) 
Large positive values of a correspond to probabilities close to one; large neg- 
ative values of a define probabilities that are close to zero. In a classifica- 
tion problem, we typically intend that the probability P(t, =1) should be a 
smoothly varying function of x. We can embody this prior belief by defining 
a(x) to have a Gaussian process prior. 


Implementation 


It is not so easy to perform inferences and adapt the Gaussian process model 
to data in a classification model as in regression problems because the like- 
lihood function (45.56) is not a Gaussian function of an. So the posterior 
distribution of a given some observations t is not Gaussian and the normal- 
ization constant P(t |X») cannot be written down analytically. Barber and 
Williams (1997) have implemented classifiers based on Gaussian process priors 
using Laplace approximations (Chapter 27). Neal (1997b) has implemented a 
Monte Carlo approach to implementing a Gaussian process classifier. Gibbs 
and MacKay (2000) have implemented another cheap and cheerful approach 
based on the methods of Jaakkola and Jordan (section 33.8). In this varia- 
tional Gaussian process classifier, we obtain tractable upper and lower bounds 
for the unnormalized posterior density over a, P(ty|a)P(a). These bounds 
are parameterized by variational parameters which are adjusted in order to 
obtain the tightest possible fit. Using normalized versions of the optimized 
bounds we then compute approximations to the predictive distributions. 

Multi-class classification problems can also be solved with Monte Carlo 
methods (Neal, 1997b) and variational methods (Gibbs, 1997). 


> 45.7 Discussion 


Gaussian processes are moderately simple to implement and use. Because very 
few parameters of the model need to be determined by hand (generally only 
the priors on the hyperparameters), Gaussian processes are useful tools for 
automated tasks where fine tuning for each problem is not possible. We do 
not appear to sacrifice any performance for this simplicity. 

It is easy to construct Gaussian processes that have particular desired 
properties; for example we can make a straightforward automatic relevance 
determination model. 

One obvious problem with Gaussian processes is the computational cost 
associated with inverting an N x N matrix. The cost of direct methods of 
inversion becomes prohibitive when the number of data points N is greater 
than about 1000. 


Have we thrown the baby out with the bath water? 


According to the hype of 1987, neural networks were meant to be intelligent 
models that discovered features and patterns in data. Gaussian processes in 
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contrast are simply smoothing devices. How can Gaussian processes possi- 
bly replace neural networks? Were neural networks over-hyped, or have we 
underestimated the power of smoothing methods? 

I think both these propositions are true. The success of Gaussian processes 
shows that many real-world data modelling problems are perfectly well solved 
by sensible smoothing methods. The most interesting problems, the task of 
feature discovery for example, are not ones that Gaussian processes will solve. 
But maybe multilayer perceptrons can’t solve them either. Perhaps a fresh 
start is needed, approaching the problem of machine learning from a paradigm 
different from the supervised feedforward mapping. 


Further reading 


The study of Gaussian processes for regression is far from new. Time series 
analysis was being performed by the astronomer T.N. Thiele using Gaussian 
processes in 1880 (Lauritzen, 1981). In the 1940s, Wiener—Kolmogorov pre- 
diction theory was introduced for prediction of trajectories of military targets 
(Wiener, 1948). Within the geostatistics field, Matheron (1963) proposed a 
framework for regression using optimal linear estimators which he called ‘krig- 
ing’ after D.G. Krige, a South African mining engineer. This framework is 
identical to the Gaussian process approach to regression. Kriging has been 
developed considerably in the last thirty years (see Cressie (1993) for a re- 
view) including several Bayesian treatments (Omre, 1987; Kitanidis, 1986). 
However the geostatistics approach to the Gaussian process model has con- 
centrated mainly on low-dimensional problems and has largely ignored any 
probabilistic interpretation of the model. Kalman filters are widely used to 
implement inferences for stationary one-dimensional Gaussian processes, and 
are popular models for speech and music modelling (Bar-Shalom and Fort- 
mann, 1988). Generalized radial basis functions (Poggio and Girosi, 1989), 
ARMA models (Wahba, 1990) and variable metric kernel methods (Lowe, 
1995) are all closely related to Gaussian processes. See also O’Hagan (1978). 

The idea of replacing supervised neural networks by Gaussian processes 
was first explored by Williams and Rasmussen (1996) and Neal (1997b). A 
thorough comparison of Gaussian processes with other methods such as neural 
networks and MARS was made by Rasmussen (1996). Methods for reducing 
the complexity of data modelling with Gaussian processes remain an active 
research area (Poggio and Girosi, 1990; Luo and Wahba, 1997; Tresp, 2000; 
Williams and Seeger, 2001; Smola and Bartlett, 2001; Rasmussen, 2002; Seeger 
et al., 2003; Opper and Winther, 2000). 

A longer review of Gaussian processes is in (MacKay, 1998b). A review 
paper on regression with complexity control using hierarchical Bayesian models 
is (MacKay, 1992a). 

Gaussian processes and support vector learning machines (Scholkopf et al., 
1995; Vapnik, 1995) have a lot in common. Both are kernel-based predictors, 
the kernel being another name for the covariance function. A Bayesian version 
of support vectors, exploiting this connection, can be found in (Chu et al., 
2001; Chu et al., 2002; Chu et al., 2003b; Chu et al., 2003a). 
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46 


Deconvolution 


»> 46.1 Traditional image reconstruction methods 


Optimal linear filters 


In many imaging problems, the data measurements {dn} are linearly related 
to the underlying image f: 


dn = X Rak fir + nn. (46.1) 
k 


The vector n denotes the inevitable noise that corrupts real data. In the case 
of a camera which produces a blurred picture, the vector f denotes the true 
image, d denotes the blurred and noisy picture, and the linear operator R 
is a convolution defined by the point spread function of the camera. In this 
special case, the true image and the data vector reside in the same space; 
but it is important to maintain a distinction between them. We will use the 
subscript n = 1,...,N to run over data measurements, and the subscripts 
k,k' = 1,..., K to run over image pixels. 

One might speculate that since the blur was created by a linear operation, 
then perhaps it might be deblurred by another linear operation. We can derive 
the optimal linear filter in two ways. 


Bayesian derivation 


We assume that the linear operator R is known, and that the noise n is 
Gaussian and independent, with a known standard deviation o,. 


P(d|f, ov, H) = we Se (- (dn — De Farfe)?/ 0) - (46.2) 


n 


We assume that the prior probability of the image is also Gaussian, with a 
scale parameter of. 


1 
det 2 C 
P(f |o, H) = Tono3)RPE P -XO PCke fi/ (207) | - (46.3) 
f k,k! 


If we assume no correlations among the pixels then the symmetric, full rank 
matrix C is equal to the identity matrix I. The more sophisticated ‘intrinsic 
correlation function’ model uses C = [GG"™|~!, where G is a convolution that 
takes us from an imaginary ‘hidden’ image, which is uncorrelated, to the real 
correlated image. The intrinsic correlation function should not be confused 
with the point spread function R which defines the image-to-data mapping. 
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A zero-mean Gaussian prior is clearly a poor assumption if it is known that 
all elements of the image f are positive, but let us proceed. We can now write 
down the posterior probability of an image f given the data d. 


P(d|f,o,,H)P(E | of H)) 


P(£|d, o1, 0f, H) = P(d|o, oF H) 


(46.4) 


In words, 
Likelihood x Prior 


Posterior = 
Evidence 


(46.5) 


The ‘evidence’ P(d |o, of, H) is the normalizing constant for this posterior 
distribution. Here it is unimportant, but it is used in a more sophisticated 
analysis to compare, for example, different values of oy and of, or different 
point spread functions R. 

Since the posterior distribution is the product of two Gaussian functions of 
f, it is also a Gaussian, and can therefore be summarized by its mean, which 
is also the most probable image, fyp, and its covariance matrix: 


Lea = -VV log P(£ |d, ov, o7,H)]*, (46.6) 


which defines the joint error bars on f. In this equation, the symbol V denotes 
differentiation with respect to the image parameters f. We can find fyp by 
differentiating the log of the posterior, and solving for the derivative being 


zero. We obtain: i 


R'd. (46.7) 


2 
RREO 
mf 


furp = 








-1 
The operator FR 4 žc] R' is called the optimal linear filter. When 
Si: 


2 
the term C can be neglected, the optimal linear filter is the pseudoinverse 
f 


[RR] ' R7. The term $C regularizes this ill-conditioned inverse. 
The optimal linear filte? can also be manipulated into the form: 
2 7-1 
Optimal linear filter = C-'R™ frome + TI (46.8) 
f 





Minimum square error derivation 


The non-Bayesian derivation of the optimal linear filter starts by assuming 
that we will ‘estimate’ the true image f by a linear function of the data: 


f = Wd. (46.9) 


The linear operator W is then ‘optimized’ by minimizing the expected sum- 
squared error between f and the unknown true image . In the following equa- 
tions, summations over repeated indices k, k’, n are implicit. The expectation 
(-) is over both the statistics of the random variables {nn}, and the ensemble 
of images f which we expect to bump into. We assume that the noise is zero 
mean and uncorrelated to second order with itself and everything else, with 
(Tintin!) = 02 6pm! - 


(E) 


((Windn - FP) (46.10) 


1 
((WinRni fi — fx)”) + 5 WenWinor- (46.11) 


NLR N| =e 
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Differentiating with respect to W, and introducing F = (fj fi) (cf. oC! in 
the Bayesian derivation above), we find that the optimal linear filter is: 


Wop: = FR" [RFR + 021]. (46.12) 


If we identify F = oC, we obtain the optimal linear filter (46.8) of the 
Bayesian derivation. The ad hoc assumptions made in this derivation were the 
choice of a quadratic error measure, and the decision to use a linear estimator. 
It is interesting that without explicit assumptions of Gaussian distributions, 
this derivation has reproduced the same estimator as the Bayesian posterior 
mode, fur- 

The advantage of a Bayesian approach is that we can criticize these as- 
sumptions and modify them in order to make better reconstructions. 


Other image models 


The better matched our model of images P(f | H) is to the real world, the bet- 
ter our image reconstructions will be, and the less data we will need to answer 
any given question. The Gaussian models which lead to the optimal linear 
filter are spectacularly poorly matched to the real world. For example, the 
Gaussian prior (46.3) fails to specify that all pixel intensities in an image are 
positive. This omission leads to the most pronounced artefacts where the im- 
age under observation has high contrast or large black patches. Optimal linear 
filters applied to astronomical data give reconstructions with negative areas in 
them, corresponding to patches of sky that suck energy out of telescopes! The 
maximum entropy model for image deconvolution (Gull and Daniell, 1978) 
was a great success principally because this model forced the reconstructed 
image to be positive. The spurious negative areas and complementary spu- 
rious positive areas are eliminated, and the quality of the reconstruction is 
greatly enhanced. 
The ‘classic maximum entropy’ model assigns an entropic prior 


P(f | a, m, Hotassic) = exp(aS(f,m))/Z, (46.13) 


where 


S(£, m) = X (fi la(m;/ fi) + fi — mi) (46.14) 


a 


(Skilling, 1989). This model enforces positivity; the parameter a defines a 
characteristic dynamic range by which the pixel values are expected to differ 
from the default image m. 

The ‘intrinsic-correlation-function maximum-entropy’ model (Gull, 1989) 
introduces an expectation of spatial correlations into the prior on f by writing 
f = Gh, where G is a convolution with an intrinsic correlation function, and 
putting a classic maxent prior on the underlying hidden image h. 


Probabilistic movies 


Having found not only the most probable image fyp but also error bars on 
it, fja, one task is to visualize those error bars. Whether or not we use 
Monte Carlo methods to infer f, a correlated random walk around the posterior 
distribution can be used to visualize the uncertainties and correlations. For 
a Gaussian posterior distribution, we can create a correlated sequence of unit 
normal random vectors n using 


n+) = on + sz, (46.15) 
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552 46 — Deconvolution 
where z is a unit normal random vector and c? + s? = 1 (c controls how 
persistent the memory of the sequence is). We then render the image sequence 
defined by 
1/2 
£0 = fup + Egan” (46.16) 


where Sa is the Cholesky decomposition of Ypa- 


> 46.2 Supervised neural networks for image deconvolution 


Neural network researchers often exploit the following strategy. Given a prob- 
lem currently solved with a standard algorithm: interpret the computations 
performed by the algorithm as a parameterized mapping from an input to an 
output, and call this mapping a neural network; then adapt the parameters 
to data so as to produce another mapping that solves the task better. By 
construction, the neural network can reproduce the standard algorithm, so 
this data-driven adaptation can only make the performance better. 

There are several reasons why standard algorithms can be bettered in this 
way. 


1. Algorithms are often not designed to optimize the real objective func- 
tion. For example, in speech recognition, a hidden Markov model is 
designed to model the speech signal, and is fitted so as to to maximize 
the generative probability given the known string of words in the training 
data; but the real objective is to discriminate between different words. 
If an inadequate model is being used, the neural-net-style training of 
the model will focus the limited resources of the model on the aspects 
relevant to the discrimination task. Discriminative training of hidden 
Markov models for speech recognition does improve their performance. 


2. The neural network can be more flexible than the standard model; some 
of the adaptive parameters might have been viewed as fixed features by 
the original designers. A flexible network can find properties in the data 
that were not included in the original model. 


> 46.3 Deconvolution in humans 


A huge fraction of our brain is devoted to vision. One of the neglected features 
of our visual system is that the raw image falling on the retina is severely 
blurred: while most people can see with a resolution of about 1 arcminute 
(one sixtieth of a degree) under any daylight conditions, bright or dim, the 
image on our retina is blurred through a point spread function of width as 
large as 5 arcminutes (Wald and Griffin, 1947; Howarth and Bradley, 1986). 
It is amazing that we are able to resolve pixels that are twenty-five times 
smaller in area than the blob produced on our retina by any point source. 
Isaac Newton was aware of this conundrum. It’s hard to make a lens that 
does not have chromatic aberration, and our cornea and lens, like a lens made 
of ordinary glass, refract blue light more strongly than red. Typically our eyes 
focus correctly for the middle of the visible spectrum (green), so if we look 
at a single white dot made of red, green, and blue light, the image on our 
retina consists of a sharply focussed green dot surrounded by a broader red 
blob superposed on an even broader blue blob. The width of the red and blue 
blobs is proportional to the diameter of the pupil, which is largest under dim 
lighting conditions. [The blobs are roughly concentric, though most people 
have a slight bias, such that in one eye the red blob is centred a tiny distance 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http:/www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


46.3: Deconvolution in humans 553 


to the left and the blue is centred a tiny distance to the right, and in the other 
eye it’s the other way round. This slight bias explains why when we look 
at blue and red writing on a dark background most people perceive the blue 
writing to be at a slightly greater depth than the red. In a minority of people, 
this small bias is the other way round and the red/blue depth perception is 
reversed. But this effect (which many people are aware of, having noticed it 
in cinemas, for example) is tiny compared with the chromatic aberration we 
are discussing. | 

You can vividly demonstrate to yourself how enormous the chromatic aber- 
ration in your eye is with the help of a sheet of card and a colour computer 
screen. 

For the most impressive results — I guarantee you will be amazed — use 
a dim room with no light apart from the computer screen; a pretty strong 
effect will still be seen even if the room has daylight coming into it, as long as 
it is not bright sunshine. Cut a slit about 1.5mm wide in the card. On the 
screen, display a few small coloured objects on a black background. I especially 
recommend thin vertical objects coloured pure red, pure blue, magenta (i.e., 
red plus blue), and white (red plus blue plus green).! Include a little black- 
and-white text on the screen too. Stand or sit sufficiently far away that you 
can only just read the text — perhaps a distance of four metres or so, if you have 
normal vision. Now, hold the slit vertically in front of one of your eyes, and 
close the other eye. Hold the slit near to your eye — brushing your eyelashes — 
and look through it. Waggle the slit slowly to the left and to the right, so that 
the slit is alternately in front of the left and right sides of your pupil. What 
do you see? I see the red objects waggling to and fro, and the blue objects 
waggling to and fro, through huge distances and in opposite directions, while 
white objects appear to stay still and are negligibly distorted. Thin magenta 
objects can be seen splitting into their constituent red and blue parts. Measure 
how large the motion of the red and blue objects is — it’s more than 5 minutes 
of arc for me, in a dim room. Then check how sharply you can see under these 
conditions — look at the text on the screen, for example: is it not the case that 
you can see (through your whole pupil) features far smaller than the distance 
through which the red and blue components were waggling? Yet when you are 
using the whole pupil, what is falling on your retina must be an image blurred 
with a blurring diameter equal to the waggling amplitude. 

One of the main functions of early visual processing must be to deconvolve 
this chromatic aberration. Neuroscientists sometimes conjecture that the rea- 
son why retinal ganglion cells and cells in the lateral geniculate nucleus (the 
main brain area to which retinal ganglion cells project) have centre-surround 
receptive fields with colour opponency (long wavelength in the centre and 
medium wavelength in the surround, for example) is in order to perform ‘fea- 
ture extraction’ or ‘edge detection’, but I think this view is mistaken. The 
reason we have centre-surround filters at the first stage of visual processing 
(in the fovea at least) is for the huge task of deconvolution of chromatic aber- 
ration. 

I speculate that the McCollough effect, an extremely long-lasting associ- 
ation of colours with orientation (McCollough, 1965; MacKay and MacKay, 
1974), is produced by the adaptation mechanism that tunes our chromatic- 
aberration-deconvolution circuits. Our deconvolution circuits need to be rapidly 
tuneable, because the point spread function of our eye changes with our pupil 
diameter, which can change within seconds; and indeed the McCollough effect 
can be induced within 30 seconds. At the same time, the effect is long-lasting 
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when an eye is covered, because it’s in our interests that our deconvolution 
circuits should stay well-tuned while we sleep, so that we can see sharply the 
instant we wake up. 

I also wonder whether the main reason that we evolved colour vision was 
not ‘in order to see fruit better’ but ‘so as to be able to see black and white 
sharper’—deconvolving chromatic aberration is easier, even in an entirely black 
and white world, if one has access to chromatic information in the image. 

And a final speculation: why do our eyes make micro-saccades when we 
look at things? These miniature eye-movements are of an angular size big- 
ger than the spacing between the cones in the fovea (which are spaced at 
roughly 1 minute of arc, the perceived resolution of the eye). The typical 
size of a microsaccade is 5-10 minutes of arc (Ratliff and Riggs, 1950). Is it a 
coincidence that this is the same as the size of chromatic aberration? Surely 
micro-saccades must play an essential role in the deconvolution mechanism 
that delivers our high-resolution vision. 


> 46.4 Exercises 


Exercise 46.1.[°] Blur an image with a circular (top hat) point spread func- 
tion and add noise. Then deconvolve the blurry noisy image using the 
optimal linear filter. Find error bars and visualize them by making a 
probabilistic movie. 
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About Part VI 


The central problem of communication theory is to construct an encoding 
and a decoding system that make it possible to communicate reliably over 
a noisy channel. During the 1990s, remarkable progress was made towards 
the Shannon limit, using codes that are defined in terms of sparse random 
graphs, and which are decoded by a simple probability-based message-passing 
algorithm. 

In a sparse-graph code, the nodes in the graph represent the transmitted 
bits and the constraints they satisfy. For a linear code with a codeword length 
N and rate R = K/N, the number of constraints is of order M = N — K. 
Any linear code can be described by a graph, but what makes a sparse-graph 
code special is that each constraint involves only a small number of variables 
in the graph: so the number of edges in the graph scales roughly linearly with 
N, rather than quadratically. 

In the following four chapters we will look at four families of sparse-graph 
codes: three families that are excellent for error-correction: low-density parity- 
check codes, turbo codes, and repeat—accumulate codes; and the family of 
digital fountain codes, which are outstanding for erasure-correction. 

All these codes can be decoded by a local message-passing algorithm on the 
graph, the sum—product algorithm, and, while this algorithm is not a perfect 
maximum likelihood decoder, the empirical results are record-breaking. 
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AT 


Low-Density Parity-Check Codes 





A low-density parity-check code (or Gallager code) is a block code that has a fe, o atat 
parity-check matrix, H, every row and column of which is ‘sparse’. 

A regular Gallager code is a low-density parity-check code in which every 
column of H has the same weight j and every row has the same weight k; reg- 
ular Gallager codes are constructed at random subject to these constraints. A 
low-density parity-check code with j = 3 and k = 4 is illustrated in figure 47.1. 

















+) E 













































































3 r Figure 47.1. A low-density 
> 47.1 Theoretical properties parity-check matrix and the 


corresponding graph of a rate- 1/4 
Low-density parity-check codes lend themselves to theoretical study. The fol- low-density parity-check code 
lowing results are proved in Gallager (1963) and MacKay (1999b). with blocklength N = 16, and 
Low-density parity-check codes, in spite of their simple construction, are M = 12 constraints. Each white 
good codes, given an optimal decoder (good codes in the sense of section 11.4). circle represents a transmitted bit. 
Furthermore, they have good distance (in the sense of section 13.2). These two Banh pit participates g= 
; : constraints, represented by [+ 
results hold for any column weight j > 3. Furthermore, there are sequences of squares. Each constraint forces 
low-density parity-check codes in which j increases gradually with N, in such the sum of the k = 4 bits to which 
a way that the ratio j/N still goes to zero, that are very good, and that have it is connected to be even. 
very good distance. 
However, we don’t have an optimal decoder, and decoding low-density 
parity-check codes is an NP-complete problem. So what can we do in practice? 














> 47.2 Practical decoding 


Given a channel output r, we wish to find the codeword t whose likelihood 
P(r|t) is biggest. All the effective decoding strategies for low-density parity- 
check codes are message-passing algorithms. The best algorithm known is 
the sum-product algorithm, also known as iterative probabilistic decoding or 
belief propagation. 

We'll assume that the channel is a memoryless channel (though more com- 
plex channels can easily be handled by running the sum—product algorithm 
on a more complex graph that represents the expected correlations among the 
errors (Worthen and Stark, 1998)). For any memoryless channel, there are 
two approaches to the decoding problem, both of which lead to the generic 
problem ‘find the x that maximizes 


P*(x) = P(x) 1[Hx = z]’, (47.1) 


where P(x) is a separable distribution on a binary vector x, and z is another 
binary vector. Each of these two approaches represents the decoding problem 
in terms of a factor graph (Chapter 26). 
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(a) The prior distribution over codewords 
P(t) x 1[Ht = 0]. 


The variable nodes are the transmitted bits {tn}. 
Each [H] node represents the factor 1[) ,c, (mn) tn =0 mod 2]. 





















































































































































(b) The posterior distribution over codewords, 









































P(t|r) x P(t)P(r|t). 


Each upper function node represents a likelihood factor P(rp | tn). 








(c) The joint probability of the noise n and syndrome z, 












































































































































P(n,z) = P(n) 1[z= Hn]. 


The top variable nodes are now the noise bits {npn}. 
The added variable nodes at the base are the syndrome values 
{2m }- 


Each definition zm = DDA HmnĦnn mod 2 is enforced by a [] factor. 































































































Figure 47.2. Factor graphs 
associated with a low-density 


The cod } } int 
e codeword decoding viewpoin pan E PE E A 


First, we note that the prior distribution over codewords, 
P(t) x 1[Ht = 0 mod 2], (47.2) 


can be represented by a factor graph (figure 47.2a), with the factorization 
being 
P(t) « J [al] Sot, =0mod QI. (47.3) 


m  nEN(m) 


(We’ll omit the ‘mod 2’s from now on.) The posterior distribution over code- 
words is given by multiplying this prior by the likelihood, which introduces 
another N factors, one for each received bit. 


P(t|r) « P(t)P(r|t) 


x [fal Sot =0) [] Porn| tr) (47.4) 


m  nEeN(m) n 


The factor graph corresponding to this function is shown in figure 47.2b. It 
is the same as the graph for the prior, except for the addition of likelihood 
‘dongles’ to the transmitted bits. 

In this viewpoint, the received signal rn can live in any alphabet; all that 
matters are the values of P(rn | tn). 


The syndrome decoding viewpoint 


Alternatively, we can view the channel output in terms of a binary received 
vector r and a noise vector n, with a probability distribution P(n) that can 
be derived from the channel properties and whatever additional information 
is available at the channel outputs. 

For example, with a binary symmetric channel, we define the noise by 
r = t +n, the syndrome z = Hr, and noise model P(n, =1) = f. For other 
channels such as the Gaussian channel with output y, we may define a received 
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binary vector r however we wish and obtain an effective binary noise model 
P(n) from y (exercises 9.18 (p.155) and 25.1 (p.325)). 
The joint probability of the noise n and syndrome z = Hn can be factored 


P(n,z) = P(n)1[z=Hn] 
= [J] P(r) [[ Alem = So mn]. (47.5) 
n m nEN (m) 


The factor graph of this function is shown in figure 47.2c. The variables n 
and z can also be drawn in a ‘belief network’ (also known as a ‘Bayesian 
network’, ‘causal network’, or ‘influence diagram’) similar to figure 47.2a, but 
with arrows on the edges from the upper circular nodes (which represent the 
variables n) to the lower square nodes (which now represent the variables z). 
We can say that every bit £n is the parent of j checks zm, and each check zm 
is the child of k bits. 

Both decoding viewpoints involve essentially the same graph. Either ver- 
sion of the decoding problem can be expressed as the generic decoding problem 
‘find the x that maximizes 


P*(x) = P(x) 1[Hx=z]’; (47.6) 


in the codeword decoding viewpoint, x is the codeword t, and z is 0; in the 
syndrome decoding viewpoint, x is the noise n, and z is the syndrome. 

It doesn’t matter which viewpoint we take when we apply the sum—product 
algorithm. The two decoding algorithms are isomorphic and will give equiva- 
lent outcomes (unless numerical errors intervene). 


I tend to use the syndrome decoding viewpoint because it has one advantage: 
one does not need to implement an encoder for a code in order to be able to 
simulate a decoding problem realistically. 


We’ll now talk in terms of the generic decoding problem. 


> 47.3 Decoding with the sum—product algorithm 


We aim, given the observed checks, to compute the marginal posterior proba- 
bilities P(£n = 1 |z, H) for each n. It is hard to compute these exactly because 
the graph contains many cycles. However, it is interesting to implement the 
decoding algorithm that would be appropriate if there were no cycles, on the 
assumption that the errors introduced might be relatively small. This ap- 
proach of ignoring cycles has been used in the artificial intelligence literature 
but is now frowned upon because it produces inaccurate probabilities. How- 
ever, if we are decoding a good error-correcting code, we don’t care about 
accurate marginal probabilities — we just want the correct codeword. Also, 
the posterior probability, in the case of a good code communicating at an 
achievable rate, is expected typically to be hugely concentrated on the most 
probable decoding; so we are dealing with a distinctive probability distribution 
to which experience gained in other fields may not apply. 

The sum—product algorithm was presented in Chapter 26. We now write 
out explicitly how it works for solving the decoding problem 


Hx =z (mod2). 


For brevity, we reabsorb the dongles hanging off the x and z nodes in fig- 
ure 47.2c and modify the sum—product algorithm accordingly. The graph in 
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which x and z live is then the original graph (figure 47.2a) whose edges are 
defined by the 1s in H. The graph contains nodes of two types, which we'll 
call checks and bits. The graph connecting the checks and bits is a bipartite 
graph: bits connect only to checks, and vice versa. On each iteration, a prob- 
ability ratio is propagated along each edge in the graph, and each bit node £n 
updates its probability that it should be in state 1. 

We denote the set of bits n that participate in check m by N(m) = {n: 
Ayn =1}. Similarly we define the set of checks in which bit n participates, 
M(n) = {m: Hmn=1}. We denote a set N(m) with bit n excluded by 
N(m)\n. The algorithm has two alternating parts, in which quantities qmn 
and rmn associated with each edge in the graph are iteratively updated. The 
quantity q7,,, is meant to be the probability that bit n of x has the value zx, 
given the information obtained via checks other than check m. The quantity 
zn iS meant to be the probability of check m being satisfied if bit n of x is 
considered fixed at x and the other bits have a separable distribution given 
by the probabilities {qmn : n” E€ N(m)\n}. The algorithm would produce the 
exact posterior probabilities of all the bits after a fixed number of iterations 
if the bipartite graph defined by the matrix H contained no cycles. 


T. 


Initialization. Let p? = P(x, =0) (the prior probability that bit æn is 0), 
and let p} = P(a,=1) = 1 — pl. If we are taking the syndrome decoding 
viewpoint and the channel is a binary symmetric channel then p} will equal 
f. If the noise level varies in a known way (for example if the channel is a 
binary-input Gaussian channel with a real output) then pi is initialized to the 
appropriate normalized likelihood. For every (n,m) such that Hmn =1 the 
variables g?,,, and q}, are initialized to the values p? and p} respectively. 





Horizontal step. In the horizontal step of the algorithm (horizontal from 
the point of view of the matrix H), we run through the checks m and compute 
for each n E€ N(m) two probabilities: first, r°,,,, the probability of the observed 


value of zm arising when zn = 0, given that the other bits {£w : n” 4 n} have 
a separable distribution given by the probabilities eee Gah defined by: 


n= 5 P (zm|£n=0, {zv : n’ E€ N(m)\n}) II ae, 
{tyr n EN (m)\n} n'EN (m)\n 
(47.7) 
and second, r},,,, the probability of the observed value of zm arising when 
£n = 1, defined by: 


rin = 5 P (zm|£n=1, {ani n € N(m)\n}) II Gots 
{a,rn'EN(m)\n} n/EN (m)\n 
(47.8) 
The conditional probabilities in these summations are either zero or one, de- 
pending on whether the observed zm matches the hypothesized values for £n 
and the {zw}. 

These probabilities can be computed in various obvious ways based on 
equation (47.7) and (47.8). The computations may be done most efficiently (if 
|A/(m)| is large) by regarding Zm + £n as the final state of a Markov chain with 
states 0 and 1, this chain being started in state 0, and undergoing transitions 
corresponding to additions of the various zw, with transition probabilities 
given by the corresponding Oe as and dlnr: The probabilities for zm having its 
observed value given either x, = 0 or £n = 1 can then be found efficiently by 
use of the forward-backward algorithm (section 25.3). 
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A particularly convenient implementation of this method uses forward and 
backward passes in which products of the differences 6¢mn = dein — Grn are 


computed. We obtain drmn = ein — rin from the identity: 


rmn = (1) J] Samer. (47.9) 
VEN (m)\n 


This identity is derived by iterating the following observation: if Ç = z, + 
x,mod 2, and x, and x, have probabilities ani ql and q}, q} of being 0 and 1, 
then P(¢=1) = gig? + g?g} and P(¢=0) = qq? + qtq}. Thus P(¢=0) — 
P(C=1) = (ah — u) — d) 


0 1 a 
We recover rmn and rmn using 


ro, = YA1+ ôrmn), rhn = /2(1 — ormn). (47.10) 


mn ~~ 
The transformations into differences ôq and back from ôr to {r} may be viewed 


as a Fourier transform and an inverse Fourier transformation. 


Vertical step. The vertical step takes the computed values of r?,,, and rj, 
and updates the values of the probabilities q% and qf- For each n we 


compute: 
Gna = Gee TT hex (47.11) 
m'EM(n)\m 
die. = amiba [L eat (47.12) 
m'EM(n)\m 


where @mn is chosen such that q? n +q}ł,n = 1. These products can be efficiently 
computed in a downward pass and an upward pass. 

We can also compute the ‘pseudoposterior probabilities’ q? and q} at this 
iteration, given by: 


mEM (n) 
de = anp, [[ rhe (47.14) 
mEM (n) 


These quantities are used to create a tentative decoding x, the consistency 
of which is used to decide whether the decoding algorithm can halt. (Halt if 
Hx = z.) 

At this point, the algorithm repeats from the horizontal step. 


The stop-when-it’s-done decoding method. The recommended decod- 
ing procedure is to set #, to 1 if q} > 0.5 and see if the checks Hx = z mod 2 are 
all satisfied, halting when they are, and declaring a failure if some maximum 
number of iterations (e.g. 200 or 1000) occurs without successful decoding. In 
the event of a failure, we may still report x, but we flag the whole block as a 
failure. 

We note in passing the difference between this decoding procedure and 
the widespread practice in the turbo code community, where the decoding 
algorithm is run for a fixed number of iterations (irrespective of whether the 
decoder finds a consistent state at some earlier time). This practice is wasteful 
of computer time, and it blurs the distinction between undetected and detected 
errors. In our procedure, ‘undetected’ errors occur if the decoder finds an x 
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REDUNDAN REDUNDAN REDUNDAN 


parity bits 


Figure 47.3. Demonstration of encoding with a rate-1/2 Gallager code. The encoder is derived from 
a very sparse 10000 x 20000 parity-check matrix with three 1s per column (figure 47.4). 
(a) The code creates transmitted vectors consisting of 10000 source bits and 10000 parity- 
check bits. (b) Here, the source sequence has been altered by changing the first bit. Notice 
that many of the parity-check bits are changed. Each parity bit depends on about half of 
the source bits. (c) The transmission for the case s = (1,0,0,...,0). This vector is the 
difference (modulo 2) between transmissions (a) and (b). [Dilbert image Copyright©1997 
United Feature Syndicate, Inc., used with permission.] 


satisfying Hx = zmod2 that is not equal to the true x. ‘Detected’ errors 
occur if the algorithm runs for the maximum number of iterations without 
finding a valid decoding. Undetected errors are of scientific interest because 
they reveal distance properties of a code. And in engineering practice, it would 
seem preferable for the blocks that are known to contain detected errors to be 
so labelled if practically possible. 


Cost. In a brute-force approach, the time to create the generator matrix 
scales as N?, where N is the block size. The encoding time scales as N?, but 
encoding involves only binary arithmetic, so for the block lengths studied here 
it takes considerably less time than the simulation of the Gaussian channel. 
Decoding involves approximately 6Nj floating-point multiplies per iteration, 
so the total number of operations per decoded bit (assuming 20 iterations) 
is about 120¢/R, independent of blocklength. For the codes presented in the 
next section, this is about 800 operations. 

The encoding complexity can be reduced by clever encoding tricks invented 
by Richardson and Urbanke (2001b) or by specially constructing the parity- 
check matrix (MacKay et al., 1999). 

The decoding complexity can be reduced, with only a small loss in perfor- 
mance, by passing low-precision messages in place of real numbers (Richardson 
and Urbanke, 2001a). 


»> 47.4 Pictorial demonstration of Gallager codes 


Figures 47.3-47.7 illustrate visually the conditions under which low-density 
parity-check codes can give reliable communication over binary symmetric 
channels and Gaussian channels. These demonstrations may be viewed as 
animations on the world wide web.! 
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Figure 47.4. A low-density parity-check matrix with N = 20000 columns of weight j = 3 and M = 
10000 rows of weight k = 6. 


Encoding 


Figure 47.3 illustrates the encoding operation for the case of a Gallager code 
whose parity-check matrix is a 10000 x 20000 matrix with three 1s per col- 
umn (figure 47.4). The high density of the generator matrix is illustrated in 
figure 47.3b and c by showing the change in the transmitted vector when one 
of the 10000 source bits is altered. Of course, the source images shown here 
are highly redundant, and such images should really be compressed before 
encoding. Redundant images are chosen in these demonstrations to make it 
easier to see the correction process during the iterative decoding. The decod- 
ing algorithm does not take advantage of the redundancy of the source vector, 
and it would work in exactly the same way irrespective of the choice of source 
vector. 


Iterative decoding 


The transmission is sent over a channel with noise level f = 7.5% and the 
received vector is shown in the upper left of figure 47.5. The subsequent 
pictures in figure 47.5 show the iterative probabilistic decoding process. The 
sequence of figures shows the best guess, bit by bit, given by the iterative 
decoder, after 0, 1, 2, 3, 10, 11, 12, and 13 iterations. The decoder halts after 
the 13th iteration when the best guess violates no parity checks. This final 
decoding is error free. 


In the case of an unusually noisy transmission, the decoding algorithm fails 
to find a valid decoding. For this code and a channel with f = 7.5%, such 
failures happen about once in every 100000 transmissions. Figure 47.6 shows 
this error rate compared with the block error rates of classical error-correcting 
codes. 
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ape 
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Figure 47.5. Iterative probabilistic decoding of a low-density parity-check code for a transmission 
received over a channel with noise level f = 7.5%. The sequence of figures shows the best 


guess, bit by bit, given by the iterative decoder, after 0, 1, 2, 3, 10, 11, 12, and 13 iterations. 
The decoder halts after the 13th iteration when the best guess violates no parity checks. 


This final decoding is error free. 
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Figure 47.7. Demonstration of a 
Gallager code for a Gaussian 
channel. (al) The received vector 
after transmission over a Gaussian 
channel with z/o = 1.185 

(Es/No = 1.47dB). The greyscale 
represents the value of the 
normalized likelihood. This 
transmission can be perfectly 
decoded by the sum—product 
decoder. The empirical 
probability of decoding failure is 
about 1075. (a2) The probability 
distribution of the output y of the 
channel with x/o = 1.185 for each 
of the two possible inputs. (b1) 
The received transmission over a 
Gaussian channel with w/o = 1.0, 
which corresponds to the Shannon 
limit. (b2) The probability 
distribution of the output y of the 
channel with «/o = 1.0 for each of 
the two possible inputs. 


























Figure 47.8. Performance of 
rate-1/2 Gallager codes on the 
Gaussian channel. Vertical axis: 
block error probability. Horizontal 
axis: signal-to-noise ratio Ey/No. 
(a) Dependence on blocklength N 
for (j, k) = (3,6) codes. From left 
to right: N = 816, N = 408, 
N = 204, N = 96. The dashed 
aie nll E Sciam tl lines show the frequency of 
1 15 2 25 3 35 4 45 5 55 1 1.5 2 2.5 3 3.5 4 undetected errors, which is 

(a) (b) measurable only when the 
blocklength is as small as N = 96 
or N = 204. (b) Dependence on 


G ; i I column weight j for codes of 
EG EEE, blocklength N = 816. 
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In figure 47.7 the left picture shows the received vector after transmission over 


a Gaussian channel with «/o = 1.185. The greyscale represents the value 
P(y|t=1 

(yit S t=0)" 

z/o = 1.185 is a noise level at which this rate-1/2 Gallager code communicates 

reliably (the probability of error is ~ 1075). To show how close we are to the 

Shannon limit, the right panel shows the received vector when the signal-to- 

noise ratio is reduced to x/ø = 1.0, which corresponds to the Shannon limit 


for codes of rate 1/2. 


of the normalized likelihood, P This signal-to-noise ratio 


Variation of performance with code parameters 


Figure 47.8 shows how the parameters N and j affect the performance of 
low-density parity-check codes. As Shannon would predict, increasing the 
blocklength leads to improved performance. The dependence on j follows a 
different pattern. Given an optimal decoder, the best performance would be 
obtained for the codes closest to random codes, that is, the codes with largest 
j. However, the sum—product decoder makes poor progress in dense graphs, 
so the best performance is obtained for a small value of j. Among the values 
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Figure 47.9. Schematic illustration 
of constructions (a) of a 
completely regular Gallager code 
with j = 3, k = 6 and R = 1/2; 
(b) of a nearly-regular Gallager 
code with rate 1/3. Notation: an 
integer represents a number of 
permutation matrices superposed 
on the surrounding square. A 
diagonal line represents an 
identity matrix. 


Figure 47.10. Monte Carlo simulation of density evolution, following the decoding process for j =4, k = 
8. Each curve shows the average entropy of a bit as a function of number of iterations, 
as estimated by a Monte Carlo algorithm using 10000 samples per iteration. The noise 
level of the binary symmetric channel f increases by steps of 0.005 from bottom graph 
(f =0.010) to top graph (f = 0.100). There is evidently a threshold at about f = 0.075, 


above which the algorithm cannot determine x. From MacKay (1999b). 


of j shown in the figure, j = 3 is the best, for a blocklength of 816, down to a 
block error probability of 1075. 

This observation motivates construction of Gallager codes with some col- 
umns of weight 2. A construction with M/2 columns of weight 2 is shown in 
figure 47.9b. Too many columns of weight 2, and the code becomes a much 
poorer code. 

As we'll discuss later, we can do even better by making the code even more 
irregular. 


47.5 Density evolution 


One way to study the decoding algorithm is to imagine it running on an infinite 
tree-like graph with the same local topology as the Gallager code’s graph. 
The larger the matrix H, the closer its decoding properties should approach 
those of the infinite graph. 

Imagine an infinite belief network with no loops, in which every bit £n 
connects to j checks and every check zm connects to k bits (figure 47.11). 
We consider the iterative flow of information in this network, and examine 
the average entropy of one bit as a function of number of iterations. At each 
iteration, a bit has accumulated information from its local network out to a 
radius equal to the number of iterations. Successful decoding will occur only 
if the average entropy of a bit decreases to zero as the number of iterations 
increases. 

The iterations of an infinite belief network can be simulated by Monte 
Carlo methods — a technique first used by Gallager (1963). Imagine a network 
of radius J (the total number of iterations) centred on one bit. Our aim is 
to compute the conditional entropy of the central bit x given the state z of 
all checks out to radius J. To evaluate the probability that the central bit 
is 1 given a particular syndrome z involves an I-step propagation from the 
outside of the network into the centre. At the ith iteration, probabilities r at 





Figure 47.11. Local topology of 
the graph of a Gallager code with 
column weight j = 3 and row 
weight k = 4. White nodes 
represent bits, xı; black nodes 
represent checks, Zm; each edge 
corresponds to a 1 in H. 


47.6: Improving Gallager codes 


radius J —i+ 1 are transformed into qs and then into rs at radius J — i in 
a way that depends on the states x of the unknown bits at radius J — i. In 
the Monte Carlo method, rather than simulating this network exactly, which 
would take a time that grows exponentially with I, we create for each iteration 
a representative sample (of size 100, say) of the values of {r, x}. In the case 
of a regular network with parameters j,k, each new pair {r,x} in the list at 
the ith iteration is created by drawing the new x from its distribution and 
drawing at random with replacement (j —1)(k—1) pairs {r, x} from the list at 
the (i—1)th iteration; these are assembled into a tree fragment (figure 47.12) 
and the sum-—product algorithm is run from top to bottom to find the new r 
value associated with the new node. 

As an example, the results of runs with 7 =4, k=8 and noise densities f 
between 0.01 and 0.10, using 10000 samples at each iteration, are shown in 
figure 47.10. Runs with low enough noise level show a collapse to zero entropy 
after a small number of iterations, and those with high noise level decrease to 
a non-zero entropy corresponding to a failure to decode. 

The boundary between these two behaviours is called the threshold of the 
decoding algorithm for the binary symmetric channel. Figure 47.10 shows by 
Monte Carlo simulation that the threshold for regular (j,k) = (4,8) codes 
is about 0.075. Richardson and Urbanke (2001a) have derived thresholds for 
regular codes by a tour de force of direct analytic methods. Some of these 
thresholds are shown in table 47.13. 


Approximate density evolution 


For practical purposes, the computational cost of density evolution can be 
reduced by making Gaussian approximations to the probability distributions 
over the messages in density evolution, and updating only the parameters of 
these approximations. For further information about these techniques, which 
produce diagrams known as EXIT charts, see (ten Brink, 1999; Chung et al., 
2001; ten Brink et al., 2002). 


47.6 Improving Gallager codes 


Since the rediscovery of Gallager codes, two methods have been found for 
enhancing their performance. 


Clump bits and checks together 


First, we can make Gallager codes in which the variable nodes are grouped 
together into metavariables consisting of say 3 binary variables, and the check 
nodes are similarly grouped together into metachecks. As before, a sparse 
graph can be constructed connecting metavariables to metachecks, with a lot 
of freedom about the details of how the variables and checks within are wired 
up. One way to set the wiring is to work in a finite field GF (q) such as GF(A4) 
or GF(8), define low-density parity-check matrices using elements of GF (q), 
and translate our binary messages into GF(q) using a mapping such as the 
one for GF(4) given in table 47.14. Now, when messages are passed during 
decoding, those messages are probabilities and likelihoods over conjunctions 
of binary variables. For example if each clump contains three binary variables 
then the likelihoods will describe the likelihoods of the eight alternative states 
of those bits. 

With carefully optimized constructions, the resulting codes over GF (4), 
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Figure 47.12. A tree-fragment 
constructed during Monte Carlo 
simulation of density evolution. 
This fragment is appropriate for a 
regular j =3, k=4 Gallager code. 


(j,k) fmax 
(3,6) 0.084 
(4,8) 0.076 
(5,10) 0.068 


Table 47.13. Thresholds fmax for 
regular low-density parity-check 
codes, assuming sum—product 
decoding algorithm, from 
Richardson and Urbanke (2001a). 
The Shannon limit for rate-1/2 
codes is fmax = 0.11. 


GF(4) © binary 


00 
01 
10 
11 


PEALT 


0 
1 
A 
B 


Table 47.14. Translation between 
GF(4) and binary for message 
symbols. 


GF(4) — binary 


0 = g 
1 = 8 
- 1 
p= Y 


Table 47.15. Translation between 
GF(4) and binary for matrix 
entries. An M x N parity-check 
matrix over GF'(4) can be turned 
into a 2M x 2N binary 
parity-check matrix in this way. 
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Algorithm 47.16. The Fourier 
transform over GF(4). 

The Fourier transform F of a 
function f over GF(2) is given by 
F? =f? | ft, Fi = f° fi. 
Transforms over GF(2*) can be 
viewed as a sequence of binary 
transforms in each of k 
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Figure 47.17. Comparison of regular binary Gallager codes with irregular codes, codes over GF (q), 
and other outstanding codes of rate 1/4. From left (best performance) to right: Irregular 
low-density parity-check code over GF(8), blocklength 48000 bits (Davey, 1999); JPL 
turbo code (JPL, 1996) blocklength 65 536; Regular low-density parity-check over GF'(16), 
blocklength 24448 bits (Davey and MacKay, 1998); Irregular binary low-density parity- 
check code, blocklength 16 000 bits (Davey, 1999); Luby et al. (1998) irregular binary low- 
density parity-check code, blocklength 64000 bits; JPL code for Galileo (in 1992, this was 
the best known code of rate 1/4); Regular binary low-density parity-check code: blocklength 
40000 bits (MacKay, 1999b). The Shannon limit is at about —0.79dB. As of 2003, even 
better sparse-graph codes have been constructed. 


GF‘(8), and GF(16) perform nearly one decibel better than comparable binary 
Gallager codes. 

The computational cost for decoding in GF'(q) scales as qlog q, if the ap- 
propriate Fourier transform is used in the check nodes: the update rule for 
the check-to-variable message, 


Eiin = 5 1 5 Arn! Xn! = 2m Il a (47.15) 


XI0n=a n'EN (m) JEN (m)\n 


is a convolution of the quantities dmj» SO the summation can be replaced by 
a product of the Fourier transforms of dmj for j € N(m)\n, followed by 
an inverse Fourier transform. The Fourier transform for GF(4) is shown in 
algorithm 47.16. 


Make the graph irregular 


The second way of improving Gallager codes, introduced by Luby et al. (2001b), 
is to make their graphs irregular. Instead of giving all variable nodes the same 
degree j, we can have some variable nodes with degree 2, some 3, some 4, and 
a few with degree 20. Check nodes can also be given unequal degrees — this 
helps improve performance on erasure channels, but it turns out that for the 
Gaussian channel, the best graphs have regular check degrees. 

Figure 47.17 illustrates the benefits offered by these two methods for im- 
proving Gallager codes, focussing on codes of rate 4/4. Making the binary code 
irregular gives a win of about 0.4dB; switching from GF(2) to GF(16) gives 
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about 0.6dB; and Matthew Davey’s code that combines both these features — 
it’s irregular over GF'(8) — gives a win of about 0.9dB over the regular binary 
Gallager code. 

Methods for optimizing the profile of a Gallager code (that is, its number of 
rows and columns of each degree), have been developed by Richardson et al. 
(2001) and have led to low-density parity-check codes whose performance, 
when decoded by the sum—product algorithm, is within a hair’s breadth of the 
Shannon limit. 


Algebraic constructions of Gallager codes 


The performance of regular Gallager codes can be enhanced in a third man- 
ner: by designing the code to have redundant sparse constraints. There is a 
difference-set cyclic code, for example, that has N = 273 and K = 191, but 
the code satisfies not M = 82 but N, i.e., 273 low-weight constraints (figure 
47.18). It is impossible to make random Gallager codes that have anywhere 
near this much redundancy among their checks. The difference-set cyclic code 
performs about 0.7dB better than an equivalent random Gallager code. 

An open problem is to discover codes sharing the remarkable properties of 
the difference-set cyclic codes but with different blocklengths and rates. I call 
this task the Tanner challenge. 


47.7 Fast encoding of low-density parity-check codes 


We now discuss methods for fast encoding of low-density parity-check codes — 
faster than the standard method, in which a generator matrix G is found by 
Gaussian elimination (at a cost of order M?) and then each block is encoded 
by multiplying it by G (at a cost of order M?). 


Staircase codes 


Certain low-density parity-check matrices with M columns of weight 2 or less 
can be encoded easily in linear time. For example, if the matrix has a staircase 
structure as illustrated by the right-hand side of 


(47.16) 
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Figure 47.18. An algebraically 
constructed low-density 
parity-check code satisfying many 
redundant constraints 
outperforms an equivalent random 
Gallager code. The table shows 
the N, M, K, distance d, and row 
weight k of some difference-set 
cyclic codes, highlighting the 
codes that have large d/N, small 
k, and large N/M. In the 
comparison the Gallager code had 
(j, k) = (4,13), and rate identical 
to the N = 273 difference-set 
cyclic code. Vertical axis: block 
error probability. Horizontal axis: 
signal-to-noise ratio Ey/No (dB). 
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and if the data s are loaded into the first K bits, then the M parity bits p 
can be computed from left to right in linear time. 


Pi = Zr Hinsn 
P2 = Pı qi: paar Han Sn 
K 
ps = p + Dina Hansn (47.17) 
K 
PM = PM-it Doya1 HmMnsn- 





If we call two parts of the H matrix [H,|H,], we can describe the encoding 
operation in two steps: first compute an intermediate parity vector v = H,s; 
then pass v through an accumulator to create p. 

The cost of this encoding method is linear if the sparsity of H is exploited 
when computing the sums in (47.17). 


Fast encoding of general low-density parity-check codes 


Richardson and Urbanke (2001b) demonstrated an elegant method by which 
the encoding cost of any low-density parity-check code can be reduced from 
the straightforward method’s M? to a cost of N + g?, where g, the gap, is 
hopefully a small constant, and in the worst cases scales as a small fraction of 


N. 


Figure 47.19. The parity-check 
matrix in approximate 
lower-triangular form. 














In the first step, the parity-check matrix is rearranged, by row-interchange 
and column-interchange, into the approximate lower-triangular form shown in 
figure 47.19. The original matrix H was very sparse, so the six matrices A, 
B, T, C, D, and E are also very sparse. The matrix T is lower triangular and 
has 1s everywhere on the diagonal. 


H=|¢ = Al (47.18) 


The source vector s of length K = N — M is encoded into a transmission 
t = [s, p1, p2] as follows. 


1. Compute the upper syndrome of the source vector, 
ZA = As. (47.19) 
This can be done in linear time. 


2. Find a setting of the second parity bits, pŽ, such that the upper syn- 
drome is zero. 
pł = -T7 !z4. (47.20) 


This vector can be found in linear time by back-substitution, i.e., com- 
puting the first bit of ps, then the second, then the third, and so forth. 
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3. Compute the lower syndrome of the vector [s, 0, p4]: 
zp = Cs— Epi. (47.21) 
This can be done in linear time. 
4. Now we get to the clever bit. Define the matrix 
=-ET'B+D, (47.22) 


and find its inverse, F~!. This computation needs to be done once only, 
and its cost is of order g?. This inverse F~! is a dense g x g matrix. [If F 
is not invertible then either H is not of full rank, or else further column 
permutations of H can produce an F that is invertible.] 


Set the first parity bits, pı, to 
pi = —F ‘zz. (47.23) 


This operation has a cost of order g?. 


Claim: At this point, we have found the correct setting of the first parity 
bits, py. 


5. Discard the tentative parity bits pe and find the new upper syndrome, 
zc = Z4 + Bpi. (47.24) 
This can be done in linear time. 


6. Find a setting of the second parity bits, p2, such that the upper syndrome 
is Zero, 
p2 = -Tze (47.25) 


This vector can be found in linear time by back-substitution. 


> 47.8 Further reading 


Low-density parity-check codes codes were first studied in 1962 by Gallager, 
then were generally forgotten by the coding theory community. Tanner (1981) 
generalized Gallager’s work by introducing more general constraint nodes; the 
codes that are now called turbo product codes should in fact be called Tanner 
product codes, since Tanner proposed them, and his colleagues (Karplus and 
Krit, 1991) implemented them in hardware. Publications on Gallager codes 
contributing to their 1990s rebirth include (Wiberg et al., 1995; MacKay and 
Neal, 1995; MacKay and Neal, 1996; Wiberg, 1996; MacKay, 1999b; Spielman, 
1996; Sipser and Spielman, 1996). Low-precision decoding algorithms and fast 
encoding algorithms for Gallager codes are discussed in (Richardson and Ur- 
banke, 2001la; Richardson and Urbanke, 2001b). MacKay and Davey (2000) 
showed that low-density parity-check codes can outperform Reed-Solomon 
codes, even on the Reed-Solomon codes’ home turf: high rate and short block- 
lengths. Other important papers include (Luby et al., 2001a; Luby et al., 
2001b; Luby et al., 1997; Davey and MacKay, 1998; Richardson et al., 2001; 
Chung et al., 2001). Useful tools for the design of irregular low-density parity- 
check codes include (Chung et al., 1999; Urbanke, 2001). 

See (Wiberg, 1996; Frey, 1998; McEliece et al., 1998) for further discussion 
of the sum-product algorithm. 

For a view of low-density parity-check code decoding in terms of group 
theory and coding theory, see (Forney, 2001; Offer and Soljanin, 2000; Offer 
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and Soljanin, 2001); and for background reading on this topic see (Hartmann 
and Rudolph, 1976; Terras, 1999). There is a growing literature on the prac- 
tical design of low-density parity-check codes (Mao and Banihashemi, 2000; 
Mao and Banihashemi, 2001; ten Brink et al., 2002); they are now being 
adopted for applications from hard drives to satellite communications. 

For low-density parity-check codes applicable to quantum error-correction, 
see MacKay et al. (2004). 


»> 47.9 Exercises 


Exercise 47.1.!7] The ‘hyperbolic tangent’ version of the decoding algorithm. In 
section 47.3, the sum—product decoding algorithm for low-density parity- 
check codes was presented first in terms of quantities aes and ro then 
in terms of quantities dq and ôr. There is a third description, in which 


the {q} are replaced by log probability-ratios, 


dmn 
lmn = In. (47.26) 
dmn 
Show that 
ôqdmn = an — dhn = tanh (lmn /2). (47.27) 


Derive the update rules for {r} and {l}. 


Exercise 47.2.[% P-572] 1 am sometimes asked ‘why not decode other linear 
codes, for example algebraic codes, by transforming their parity-check 
matrices so that they are low-density, and applying the sum—product 
algorithm?’ [Recall that any linear combination of rows of H, H’ = PH, 
is a valid parity-check matrix for a code, as long as the matrix P is 
invertible; so there are many parity check matrices for any one code.] 


Explain why a random linear code does not have a low-density parity- 
check matrix. [Here, low-density means ‘having row-weight at most k’, 
where k is some small constant < N.] 


Exercise 47.3.!9] Show that if a low-density parity-check code has more than 
M columns of weight 2 — say aM columns, where a > 1 — then the code 
will have words with weight of order log M. 


Exercise 47.4.1] In section 13.5 we found the expected value of the weight 
enumerator function A(w), averaging over the ensemble of all random 
linear codes. This calculation can also be carried out for the ensemble of 
low-density parity-check codes (Gallager, 1963; MacKay, 1999b; Litsyn 
and Shevelev, 2002). It is plausible, however, that the mean value of 
A(w) is not always a good indicator of the typical value of A(w) in the 
ensemble. For example, if, at a particular value of w, 99% of codes have 
A(w) = 0, and 1% have A(w) = 100000, then while we might say the 
typical value of A(w) is zero, the mean is found to be 1000. Find the 
typical weight enumerator function of low-density parity-check codes. 


> 47.10 Solutions 


Solution to exercise 47.2 (p.572). Consider codes of rate R and blocklength 
N, having K = RN source bits and M = (1—R)N parity-check bits. Let all 
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the codes have their bits ordered so that the first K bits are independent, so 
that we could if we wish put the code in systematic form, 


G = [1x|P"); H = [Pl1y). (47.28) 


The number of distinct linear codes is the number of matrices P, which is 
N, = 2MK = 2N°R(1-R), Can these all be expressed as distinct low-density log Ni ~ N?R(1 — R) 
parity-check codes? 

The number of low-density parity-check matrices with row-weight k is 


pi ) ‘i (47.29) 


and the number of distinct codes that they define is at most 


No = o ) 7 i, MI, (47.30) 


which is much smaller than Ny, so, by the pigeon-hole principle, it is not log Ng < Nklog N 
possible for every random linear code to map on to a low-density H. 
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Convolutional Codes and Turbo Codes 


This chapter follows tightly on from Chapter 25. It makes use of the ideas of 
codes and trellises and the forward—backward algorithm. 


> 48.1 Introduction to convolutional codes 
When we studied linear block codes, we described them in three ways: 


1. The generator matrix describes how to turn a string of K arbitrary 
source bits into a transmission of N bits. 


2. The parity-check matrix specifies the M = N — K parity-check con- 
straints that a valid codeword satisfies. 


3. The trellis of the code describes its valid codewords in terms of paths 
through a trellis with labelled edges. 


A fourth way of describing some block codes, the algebraic approach, is not 
covered in this book (a) because it has been well covered by numerous other 
books in coding theory; (b) because, as this part of the book discusses, the 
state-of-the-art in error-correcting codes makes little use of algebraic coding 
theory; and (c) because I am not competent to teach this subject. 

We will now describe convolutional codes in two ways: first, in terms of 
mechanisms for generating transmissions t from source bits s; and second, in 
terms of trellises that describe the constraints satisfied by valid transmissions. 


»> 48.2 Linear-feedback shift-registers 


We generate a transmission with a convolutional code by putting a source 
stream through a linear filter. This filter makes use of a shift register, linear 
output functions, and, possibly, linear feedback. 

I will draw the shift-register in a right-to-left orientation: bits roll from 
right to left as time goes on. 

Figure 48.1 shows three linear-feedback shift-registers which could be used 
to define convolutional codes. The rectangular box surrounding the bits 
Z1...27 indicates the memory of the filter, also known as its state. All three 
filters have one input and two outputs. On each clock cycle, the source sup- 
plies one bit, and the filter outputs two bits t and t®). By concatenating 
together these bits we can obtain from our source stream 515953... a trans- 
mission stream £69 4) 19) 1) 0), ... Because there are two transmitted bits 
for every source bit, the codes shown in figure 48.1 have rate 1/2. Because 
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Octal name Figure 48.1. Linear-feedback 
shift-registers for generating 
gil tlo) convolutional codes with rate 1/2. 
27 (D] z6 (D] 25 (D] 24 (D] 23 (D] 22 (] 21 (D] zo $ The symbol (D] indicates a 
| copying with a delay of one clock 
EA SS 5 4 (8) cycle. The symbol 6 denotes 
(a) 7 7 f (1,353)s o eaa modulo 2 with no 
delay. 
The filters are (a) systematic and 
j nonrecursive; (b) nonsystematic 
and nonrecursive; (c) systematic 


27 D] z6 (D] 25 (D] 24 (D] 23 Œ] 22 ©] 21 D] 20+—— s and recursive. 
| 




















» 








=t (247, 371)s 










Z7 (D] ze (D] 25 (D] 24 (D] 23 Œ] 22 (D] 21 (D] 0 


(a) 
See ia 








(1, 371) 


these filters require k = 7 bits of memory, the codes they define are known as 
a constraint-length 7 codes. 

Convolutional codes come in three flavours, corresponding to the three 
types of filter in figure 48.1. 


Systematic nonrecursive 


The filter shown in figure 48.1a has no feedback. It also has the property that 
one of the output bits, t(, is identical to the source bit s. This encoder is 
thus called systematic, because the source bits are reproduced transparently 
in the transmitted stream, and nonrecursive, because it has no feedback. The 
other transmitted bit t® is a linear function of the state of the filter. One 
way of describing that function is as a dot product (modulo 2) between two 
binary vectors of length k + 1: a binary vector g) = (1,1,1,0,1,0,1,1) and 
the state vector Z = (Zk, Zk-1,---, 21,20). We include in the state vector the 


bit zo that will be put into the first bit of the memory on the next cycle. The 11101011 
vector g() has ge? = 1 for every « where there is a tap (a downward pointing Ale ak 
arrow) from state bit z, into the transmitted bit t. 35 3 
A convenient way to describe these binary tap vectors is in octal. Thus, Table 48.2. How taps in the delay 


this filter makes use of the tap vector 353g. I have drawn the delay lines from line are converted to octal. 
right to left to make it easy to relate the diagrams to these octal numbers. 


Nonsystematic nonrecursive 


The filter shown in figure 48.1b also has no feedback, but it is not systematic. 
It makes use of two tap vectors g“ and g) to create its two transmitted bits. 
This encoder is thus nonsystematic and nonrecursive. Because of their added 
complexity, nonsystematic codes can have error-correcting abilities superior to 
those of systematic nonrecursive codes with the same constraint length. 
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Systematic recursive 


The filter shown in figure 48.1c is similar to the nonsystematic nonrecursive 
filter shown in figure 48.1b, but it uses the taps that formerly made up g 
to make a linear signal that is fed back into the shift register along with the 
source bit. The output t©) is a linear function of the state vector as before. 
The other output is t = s, so this filter is systematic. 

A recursive code is conventionally identified by an octal ratio, e.g., fig- 
ure 48.1c’s code is denoted by (247/371)g. 





-| > £0) 
Equivalence of systematic recursive and nonsystematic nonrecursive codes 


| ead 2 9 a ©) a s 
The two filters in figure 48.1b,c are code-equivalent in that the sets of code- 


words that they define are identical. For every codeword of the nonsystematic 


: p —— 6 —— x (2) 
nonrecursive code we can choose a source stream for the other encoder such P 


that its output is identical (and vice versa). (a) (5, 7)s 

To prove this, we denote by p the quantity 5a gP zy, as shown in fig- 
ure 48.3a and b, which shows a pair of smaller but otherwise equivalent filters. ea 00) 
If the two transmissions are to be equivalent — that is, the ts are equal in | 
both figures and so are the ts — then on every cycle the source bit in the 22 (D] 21 (D] žo 
systematic code must be s = t). So now we must simply confirm that for j kes to) 
this choice of s, the systematic code’s shift register will follow the same state Sar s 
sequence as that of the nonsystematic code, assuming that the states match (b) (1, >) 


initially. In figure 48.3a we have 


Figure 48.3. Two rate-1/2 
convolutional codes with 
constraint length k = 2: 
whereas in figure 48.3b we have (a) non-recursive; (b) recursive. 
The two codes are equivalent. 


40) =p® ZO SNS (48.1) 


grave = 40) ra) p. (48.2) 


Substituting for t, and using p @® p = 0 we immediately find 


gp ocursive = zaonrecursive. (48.3) 
Thus, any codeword of a nonsystematic nonrecursive code is a codeword of 
a systematic recursive code with the same taps — the same taps in the sense 
that there are vertical arrows in all the same places in figures 48.3(a) and (b), 
though one of the arrows points up instead of down in (b). 

Now, while these two codes are equivalent, the two encoders behave dif- 
ferently. The nonrecursive encoder has a finite impulse response, that is, if 
one puts in a string that is all zeroes except for a single one, the resulting 
output stream contains a finite number of ones. Once the one bit has passed 
through all the states of the memory, the delay line returns to the all-zero 
state. Figure 48.4a shows the state sequence resulting from the source string 
s =(0, 0, 1, 0, 0, 0, 0, 0). 

Figure 48.4b shows the trellis of the recursive code of figure 48.3b and the 
response of this filter to the same source string s =(0, 0, 1, 0, 0, 0, 0, 0). The 
filter has an infinite impulse response. The response settles into a periodic 
state with period equal to three clock cycles. 


> Exercise 48.1.!4] What is the input to the recursive filter such that its state 
sequence and the transmission are the same as those of the nonrecursive 
filter? (Hint: see figure 48.5.) 
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48.2: Linear-feedback shift-registers 


11 2% he. 3K Eer K DeK DeK 
a a g 
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01 


00 = 


(a) l 
transmit 0 0 0 0 1 1 1 0 1 1 0 0 0 
source 0 0 1 0 0 0 0 
11 


01 


00 
(b) ; 
transmit 0 0 0 0 1 1 0 1 0 1 0 0 0 
source 0 0 1 0 0 0 0 
11 20% D9 3K Bg Ds Doe% 
g a ‘a a a 
* * x . X * 


01 


00 = 
transmitO 0 0 O 1 1 1 o 1 1 0 oOo 0 
source 0 0 1 1 1 0 0 
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Figure 48.4. Trellises of the 
rate-1/2 convolutional codes of 
figure 48.3. It is assumed that the 
initial state of the filter is 

(z2, 21) = (0,0). Time is on the 
horizontal axis and the state of 
the filter at each time step is the 
vertical coordinate. On the line 
segments are shown the emitted 
symbols t( and t, with stars 
for ‘1’ and boxes for ‘0’. The 
paths taken through the trellises 
when the source sequence is 
00100000 are highlighted with a 
solid line. The light dotted lines 
show the state trajectories that 
are possible for other source 
sequences. 


Figure 48.5. The source sequence 
for the systematic recursive code 
00111000 produces the same path 
through the trellis as 00100000 
does in the nonsystematic 
nonrecursive case. 
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Figure 48.6. The trellis for a k = 4 
code painted with the likelihood 
function when the received vector 
is equal to a codeword with just 
one bit flipped. There are three 
line styles, depending on the value 
of the likelihood: thick solid lines 
show the edges in the trellis that 
match the corresponding two bits 
of the received string exactly; 
thick dotted lines show edges that 
match one bit but mismatch the 
other; and thin dotted lines show 
the edges that mismatch both 
bits. 

















received 0 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 


In general a linear-feedback shift-register with k bits of memory has an impulse 
response that is periodic with a period that is at most 2" — 1, corresponding 
to the filter visiting every non-zero state in its state space. 

Incidentally, cheap pseudorandom number generators and cheap crypto- 
graphic products make use of exactly these periodic sequences, though with 
larger values of k than 7; the random number seed or cryptographic key se- 
lects the initial state of the memory. There is thus a close connection between 
certain cryptanalysis problems and the decoding of convolutional codes. 


»> 48.3 Decoding convolutional codes 


The receiver receives a bit stream, and wishes to infer the state sequence 
and thence the source stream. The posterior probability of each bit can be 
found by the sum—product algorithm (also known as the forward—backward or 
BCJR algorithm), which was introduced in section 25.3. The most probable 
state sequence can be found using the min-sum algorithm of section 25.3 
(also known as the Viterbi algorithm). The nature of this task is illustrated 
in figure 48.6, which shows the cost associated with each edge in the trellis 
for the case of a sixteen-state code; the channel is assumed to be a binary 
symmetric channel and the received vector is equal to a codeword except that 
one bit has been flipped. There are three line styles, depending on the value 
of the likelihood: thick solid lines show the edges in the trellis that match the 
corresponding two bits of the received string exactly; thick dotted lines show 
edges that match one bit but mismatch the other; and thin dotted lines show 
the edges that mismatch both bits. The min-sum algorithm seeks the path 
through the trellis that uses as many solid lines as possible; more precisely, it 
minimizes the cost of the path, where the cost is zero for a solid line, one for 
a thick dotted line, and two for a thin dotted line. 


> Exercise 48.2.4 P-581] Can you spot the most probable path and the flipped 
bit? 
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48.4: Turbo codes 





transmit 1 1 1 0 1 0 1 o 0 o0 0 1 1 1 1 0 
source 1 1 1 1 0 0 1 1 


1111 
1110 
1101 
1100 
1011 
1010 
1001 
1000 
0111 
0110 
0101 
0100 
0011 
0010 
0001 
0000 . . o ` o ` 
transmit 1 1 1 0 1 0 1 o 0 o0 0 1 1 1 0 1 
source 1 1 1 1 0 0 1 0 
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1011 
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0011 s 
0010 oe "i fend 
0001 


et s x? x? x? x x : i } : 
popis M-m ea e a aN oe go-go) E A 









Unequal protection 


A defect of the convolutional codes presented thus far is that they offer un- 
equal protection to the source bits. Figure 48.7 shows two paths through the 
trellis that differ in only two transmitted bits. The last source bit is less well 
protected than the other source bits. This unequal protection of bits motivates 
the termination of the trellis. 

A terminated trellis is shown in figure 48.8. Termination slightly reduces 
the number of source bits used per codeword. Here, four source bits are turned 
into parity bits because the k = 4 memory bits must be returned to zero. 


> 48.4 Turbo codes 


An (N, K) turbo code is defined by a number of constituent convolutional 
encoders (often, two) and an equal number of interleavers which are K x K 
permutation matrices. Without loss of generality, we take the first interleaver 
to be the identity matrix. A string of K source bits is encoded by feeding them 
into each constituent encoder in the order defined by the associated interleaver, 
and transmitting the bits that come out of each constituent encoder. Often 
the first constituent encoder is chosen to be a systematic encoder, just like the 
recursive filter shown in figure 48.6, and the second is a non-systematic one of 
rate 1 that emits parity bits only. The transmitted codeword then consists of 
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Figure 48.7. Two paths that differ 
in two transmitted bits only. 


Figure 48.8. A terminated trellis. 
When any codeword is completed, 
the filter state is 0000. 


Figure 48.10. The encoder of a 
turbo code. Each box Cy, Co, 
contains a convolutional code. 
The source bits are reordered 
using a permutation m before they 
are fed to C2. The transmitted 
codeword is obtained by 
concatenating or interleaving the 
outputs of the two convolutional 
codes. 
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ag ee’ (b) 


Figure 48.9. Rate-1/3 (a) and rate-1/2 (b) turbo codes represented as factor graphs. The circles 
represent the codeword bits. The two rectangles represent trellises of rate-1/2 convolutional 
codes, with the systematic bits occupying the left half of the rectangle and the parity bits 
occupying the right half. The puncturing of these constituent codes in the rate-1/2 turbo 
code is represented by the lack of connections to half of the parity bits in each trellis. 














K source bits followed by Mı parity bits generated by the first convolutional 
code and Mə parity bits from the second. The resulting turbo code has rate 
1/3. 

The turbo code can be represented by a factor graph in which the two 
trellises are represented by two large rectangular nodes (figure 48.9a); the K 
source bits and the first M, parity bits participate in the first trellis and the K 
source bits and the last Mə parity bits participate in the second trellis. Each 
codeword bit participates in either one or two trellises, depending on whether 
it is a parity bit or a source bit. Each trellis node contains a trellis exactly like 
the terminated trellis shown in figure 48.8, except one thousand times as long. 
[There are other factor graph representations for turbo codes that make use 
of more elementary nodes, but the factor graph given here yields the standard 
version of the sum-product algorithm used for turbo codes.] 

If a turbo code of smaller rate such as !/2 is required, a standard modifica- 
tion to the rate-1/3 code is to puncture some of the parity bits (figure 48.9b). 

Turbo codes are decoded using the sum-product algorithm described in 
Chapter 26. On the first iteration, each trellis receives the channel likelihoods, 
and runs the forward-backward algorithm to compute, for each bit, the relative 
likelihood of its being 1 or 0, given the information about the other bits. 
These likelihoods are then passed across from each trellis to the other, and 
multiplied by the channel likelihoods on the way. We are then ready for the 
second iteration: the forward-backward algorithm is run again in each trellis 
using the updated probabilities. After about ten or twenty such iterations, it’s 
hoped that the correct decoding will be found. It is common practice to stop 
after some fixed number of iterations, but we can do better. 

As a stopping criterion, the following procedure can be used at every iter- 
ation. For each time-step in each trellis, we identify the most probable edge, 
according to the local messages. If these most probable edges join up into two 
valid paths, one in each trellis, and if these two paths are consistent with each 
other, it is reasonable to stop, as subsequent iterations are unlikely to take 
the decoder away from this codeword. If a maximum number of iterations is 
reached without this stopping criterion being satisfied, a decoding error can 
be reported. This stopping procedure is recommended for several reasons: it 
allows a big saving in decoding time with no loss in error probability; it allows 
decoding failures that are detected by the decoder to be so identified — knowing 
that a particular block is definitely corrupted is surely useful information for 
the receiver! And when we distinguish between detected and undetected er- 
rors, the undetected errors give helpful insights into the low-weight codewords 


48.5: Parity-check matrices of convolutional codes and turbo codes 


of the code, which may improve the process of code design. 

Turbo codes as described here have excellent performance down to decoded 
error probabilities of about 1075, but randomly-constructed turbo codes tend 
to have an error floor starting at that level. This error floor is caused by low- 
weight codewords. To reduce the height of the error floor, one can attempt 
to modify the random construction to increase the weight of these low-weight 
codewords. The tweaking of turbo codes is a black art, and it never succeeds 
in totalling eliminating low-weight codewords; more precisely, the low-weight 
codewords can be eliminated only by sacrificing the turbo code’s excellent per- 
formance. In contrast, low-density parity-check codes rarely have error floors, 
as long as their number of weight—2 columns is not too large (cf. exercise 47.3, 
p.572). 


48.5 Parity-check matrices of convolutional codes and turbo codes 


We close by discussing the parity-check matrix of a rate-1/2 convolutional code 
viewed as a linear block code. We adopt the convention that the N bits of one 
block are made up of the N/2 bits t followed by the N/2 bits t. 


Exercise 48.3.!7] Prove that a convolutional code has a low-density parity- 
check matrix as shown schematically in figure 48.11a. 


Hint: It’s easiest to figure out the parity constraints satisfied by a convo- 
lutional code by thinking about the nonsystematic nonrecursive encoder 
(figure 48.1b). Consider putting through filter a a stream that’s been 
through convolutional filter b, and vice versa; compare the two resulting 
streams. Ignore termination of the trellises. 


The parity-check matrix of a turbo code can be written down by listing the 
constraints satisfied by the two constituent trellises (figure 48.11b). So turbo 
codes are also special cases of low-density parity-check codes. If a turbo code 
is punctured, it no longer necessarily has a low-density parity-check matrix, 
but it always has a generalized parity-check matrix that is sparse, as explained 
in the next chapter. 


Further reading 


For further reading about convolutional codes, Johannesson and Zigangirov 
(1999) is highly recommended. One topic I would have liked to include is 
sequential decoding. Sequential decoding explores only the most promising 
paths in the trellis, and backtracks when evidence accumulates that a wrong 
turning has been taken. Sequential decoding is used when the trellis is too 
big for us to be able to apply the maximum likelihood algorithm, the min- 
sum algorithm. You can read about sequential decoding in Johannesson and 
Zigangirov (1999). 

For further information about the use of the sum—product algorithm in 
turbo codes, and the rarely-used but highly recommended stopping criteria 
for halting their decoding, Frey (1998) is essential reading. (And there’s lots 
more good stuff in the same book!) 


48.6 Solutions 


Solution to exercise 48.2 (p.578). The first bit was flipped. The most probable 
path is the upper one in figure 48.7. 
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Figure 48.11. Schematic pictures 
of the parity-check matrices of (a) 
a convolutional code, rate 1/2, 
and (b) a turbo code, rate 1/3. 
Notation: A diagonal line 
represents an identity matrix. A 
band of diagonal lines represent a 
band of diagonal 1s. A circle 
inside a square represents the 
random permutation of all the 
columns in that square. A number 
inside a square represents the 
number of random permutation 
matrices superposed in that 
square. Horizontal and vertical 
lines indicate the boundaries of 
the blocks within the matrix. 
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Repeat—Accumulate Codes 


In Chapter 1 we discussed a very simple and not very effective method for 
communicating over a noisy channel: the repetition code. We now discuss a 
code that is almost as simple, and whose performance is outstandingly good. 

Repeat-accumulate codes were studied by Divsalar et al. (1998) for theo- 
retical purposes, as simple turbo-like codes that might be more amenable to 
analysis than messy turbo codes. Their practical performance turned out to 
be just as good as other sparse-graph codes. 


»> 49.1 The encoder 


1. Take K source bits. 
818283... SK 


. Repeat each bit three times, giving N = 3K bits. 


S18181828282835383 -.. SKSKSK 


. Permute these N bits using a random permutation (a fixed random 
permutation — the same one for every codeword). Call the permuted 
string u. 

U1 U2UZU4U5 UG U7TUZU9 ... UN 


. Transmit the accumulated sum. 


ui 

tı + u2 (mod 2) 
tn—1 + Un (mod 2) 
tw—1+ un (mod 2). 


5. That’s it! 





»> 49.2 Graph 


Figure 49.1a shows the graph of a repeat—accumulate code, using four types 
of node: equality constraints El, intermediate binary variables (black circles), 
parity constraints F, and the transmitted bits (white circles). 

The source sets the values of the black bits at the bottom, three at a time, 
and the accumulator computes the transmitted bits along the top. 
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Figure 49.1. Factor graphs for a 
repeat—accumulate code with rate 
1/3. (a) Using elementary nodes. 
Each white circle represents a 
transmitted bit. Each [+ 
7 constraint forces the sum of the 3 
=e bits to which it is connected to be 
even. Each black circle represents 
an intermediate binary variable. 
Each [=] constraint forces the three 
variables to which it is connected 
to be equal. 

(b) Factor graph normally used 
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Figure 49.2. Performance of six 
undetected ---*--- rate-!/3 repeat—accumulate codes 


total 





g on the Gaussian channel. The 
0.01. blocklengths range from N = 204 
©] to N = 30000. Vertical axis: 
0.001. block error probability; horizontal 
! axis: E,/No. The dotted lines 
0.0001 ` show the frequency of undetected 
errors. 
1e-05 


This graph is a factor graph for the prior probability over codewords, 
with the circles being binary variable nodes, and the squares representing 
two types of factor nodes. As usual, each Œ] contributes a factor of the form 
1[5> «=0mod 2]; each E contributes a factor of the form 1[z1 = £2= £3]. 


























> 49.3 Decoding 


The repeat—accumulate code is normally decoded using the sum-product algo- 
rithm on the factor graph depicted in figure 49.1b. The top box represents the 
trellis of the accumulator, including the channel likelihoods. In the first half 
of each iteration, the top trellis receives likelihoods for every transition in the 
trellis, and runs the forward—backward algorithm so as to produce likelihoods 
for each variable node. In the second half of the iteration, these likelihoods 
are multiplied together at the E] nodes to produce new likelihood messages to 
send back to the trellis. 


As with Gallager codes and turbo codes, the stop-when-it’s-done decoding 
method can be applied, so it is possible to distinguish between undetected 
errors (which are caused by low-weight codewords in the code) and detected 
errors (where the decoder gets stuck and knows that it has failed to find a 
valid answer). 














Figure 49.2 shows the performance of six randomly-constructed repeat- 
accumulate codes on the Gaussian channel. If one does not mind the error 
floor which kicks in at about a block error probability of 1074, the performance 
is staggeringly good for such a simple code (cf. figure 47.17). 
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> 49.4 Empirical distribution of decoding times 


It is interesting to study the number of iterations T of the sum—product algo- 
rithm required to decode a sparse-graph code. Given one code and a set of 
channel conditions, the decoding time varies randomly from trial to trial. We 
find that the histogram of decoding times follows a power law, P(T) « T7?, 
for large r. The power p depends on the signal-to-noise ratio and becomes 
smaller (so that the distribution is more heavy-tailed) as the signal-to-noise 
ratio decreases. We have observed power laws in repeat—accumulate codes 
and in irregular and regular Gallager codes. Figures 49.3(ii) and (iii) show the 
distribution of decoding times of a repeat—accumulate code at two different 
signal-to-noise ratios. The power laws extend over several orders of magnitude. 


Exercise 49.1.1] Investigate these power laws. Does density evolution predict 
them? Can the design of a code be used to manipulate the power law in 
a useful way? 


49.5 Generalized parity-check matrices 


I find that it is helpful when relating sparse-graph codes to each other to use 
a common representation for them all. Forney (2001) introduced the idea of 
a normal graph in which the only nodes are E 
have degree one or two; variable nodes with degree two can be represented on 
edges that connect aE node. The generalized parity-check matrix 
is a graphical way of representing normal graphs. In a parity-check matrix, 
the columns are transmitted bits, and the rows are linear constraints. In a 





and E] and all variable nodes 























node to a E 




















generalized parity-check matrix, additional columns may be included, which 
represent state variables that are not transmitted. One way of thinking of these 
state variables is that they are punctured from the code before transmission. 

State variables are indicated by a horizontal line above the corresponding 
columns. The other pieces of diagrammatic notation for generalized parity- 
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(iii.c) 


Figure 49.3. Histograms of 
number of iterations to find a 
valid decoding for a 
repeat—accumulate code with 
source block length K = 10000 
and transmitted blocklength 

N = 30000. (a) Block error 
probability versus signal-to-noise 
ratio for the RA code. (ii-b) 
Histogram for z/o = 0.89, 

Ey, /No = 0.749 dB. (ii.c) 

x/o = 0.90, E,/No = 0.846 dB. 
(iii.b, iii.c) Fits of power laws to 
(ii-b) (1/78) and (ii.c) (1/79). 
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Figure 49.4. The generator 
Ng matrix, parity-check matrix, and a 
N generalized parity-check matrix of 
Go = H = K- \ a repetition code with rate 1/3, 
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check matrices are, as in (MacKay, 1999b; MacKay et al., 1998): 


e A diagonal line in a square indicates that that part of the matrix contains 
an identity matrix. 


e Two or more parallel diagonal lines indicate a band-diagonal matrix with 
a corresponding number of 1s per row. 


e A horizontal ellipse with an arrow on it indicates that the corresponding 
columns in a block are randomly permuted. 


e A vertical ellipse with an arrow on it indicates that the corresponding 
rows in a block are randomly permuted. 


e An integer surrounded by a circle represents that number of superposed 
random permutation matrices. 


Definition. A generalized parity-check matrix is a pair {A,p}, where A is a 
binary matrix and p is a list of the punctured bits. The matrix defines a set 
of valid vectors x, satisfying 


Ax = 0; (49.2) 


for each valid vector there is a codeword t(x) that is obtained by puncturing 
from x the bits indicated by p. For any one code there are many generalized 
parity-check matrices. 


The rate of a code with generalized parity-check matrix {A,p} can be 
estimated as follows. If A is L x M’, and p punctures S' bits and selects N 
bits for transmission (L = N +5), then the effective number of constraints on 
the codeword, M, is 

M=M'-S, (49.3) 


the number of source bits is 
K=N-M=L-M’, (49.4) 


and the rate is greater than or equal to 


R=1-—=1-——. (49.5) 
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SN Figure 49.5. The generator matrix 
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G = 3 H = = < systematic low-density 

3 È generator-matrix code. The code 
has rate 1/3. 






































l Figure 49.6. The generator matrix 
1 3 , 3 /| * and generalized parity-check 
G = P A — > ‘ matrix of a non-systematic 

»p y 
3 3 i low-density generator-matrix 
code. The code has rate 1/2. 





























Examples 


Repetition code. The generator matrix, parity-check matrix, and generalized 
parity-check matrix of a simple rate-!/3 repetition code are shown in figure 49.4. 


Systematic low-density generator-matrix code. In an (N, Kk) systematic low- 
density generator-matrix code, there are no state variables. A transmitted 
codeword t of length N is given by 


t=G's, (49.6) 
where 
Ta Ik 
-jE aon 


with Ix denoting the K x K identity matrix, and P being a very sparse M x K 
matrix, where M = N — K. The parity-check matrix of this code is 


H = [P|Iy]. (49.8) 
In the case of a rate-!/3 code, this parity-check matrix might be represented 


as shown in figure 49.5. 


Non-systematic low-density generator-matrix code. In an (N, K) non-systematic 
low-density generator-matrix code, a transmitted codeword t of length N is 
given by 

t=G's, (49.9) 


where G' is a very sparse N x K matrix. The generalized parity-check matrix 
of this code is 
A= [G'lIy], (49.10) 


and the corresponding generalized parity-check equation is 


ax=0, sexe |E] (49.11) 3 3 





























X 
Whereas the parity-check matrix of this simple code is typically a com- (a) l (b) 
plex, dense matrix, the generalized parity-check matrix retains the underlying 
simplicity of the code. Figure 49.7. The generalized 


parity-check matrices of (a) a 
rate-1/3 Gallager code with M/2 
columns of weight 2; (b) a rate-l/2 
Low-density parity-check codes and linear MN codes. The parity-check matrix linear MN code. 

of a rate-1/3 low-density parity-check code is shown in figure 49.7a. 


In the case of a rate-!/2 code, this generalized parity-check matrix might 
be represented as shown in figure 49.6. 
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A x Figure 49.8. The generalized 

à ` parity-check matrices of (a) a 
hS S convolutional code with rate 1/2. 
J (b) a rate-1/3 turbo code built by 
parallel concatenation of two 
convolutional codes. 
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(a) (b) 


A linear MN code is a non-systematic low-density parity-check code. The 
K state bits of an MN code are the source bits. Figure 49.7b shows the 
generalized parity-check matrix of a rate-1/2 linear MN code. 


Convolutional codes. In a non-systematic, non-recursive convolutional code, 
the source bits, which play the role of state bits, are fed into a delay-line and 
two linear functions of the delay-line are transmitted. In figure 49.8a, these 
two parity streams are shown as two successive vectors of length K. [It is 
common to interleave these two parity streams, a bit-reordering that is not 
relevant here, and is not illustrated.] 


Concatenation. ‘Parallel concatenation’ of two codes is represented in one of 
these diagrams by aligning the matrices of two codes in such a way that the 
‘source bits’ line up, and by adding blocks of zero-entries to the matrix such 
that the state bits and parity bits of the two codes occupy separate columns. 
An example is given by the turbo code that follows. In ‘serial concatenation’, 
the columns corresponding to the transmitted bits of the first code are aligned 
with the columns corresponding to the source bits of the second code. 








Turbo codes. A turbo code is the parallel concatenation of two convolutional S 
codes. The generalized parity-check matrix of a rate-1/3 turbo code is shown 
in figure 49.8b. 














Repeat—accumulate codes. The generalized parity-check matrices of a rate-1/3 SN 
repeat—accumulate code is shown in figure 49.9. Repeat-accumulate codes are 
equivalent to staircase codes (section 47.7, p.569). Figure 49.9. The generalized 


Intersection. The generalized parity-check matrix of the intersection of two parity-cheçk matrix of a . 
repeat—accumulate code with rate 


codes is made by stacking their generalized parity-check matrices on top of 1/3, 
each other in such a way that all the transmitted bits’ columns are correctly 
aligned, and any punctured bits associated with the two component codes 
occupy separate columns. 
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About Chapter 50 


The following exercise provides a helpful background for digital fountain codes. 


> Exercise 50.1.9] An author proofreads his K = 700-page book by inspecting 
random pages. He makes N page-inspections, and does not take any 
precautions to avoid inspecting the same page twice. 


(a) After N = K page-inspections, what fraction of pages do you expect 
have never been inspected? 


(b) After N > K page-inspections, what is the probability that one or 
more pages have never been inspected? 


(c) Show that in order for the probability that all K pages have been 
inspected to be 1 — 6, we require N ~ K In(K/ô) page-inspections. 


[This problem is commonly presented in terms of throwing N balls at 


random into K bins; what’s the probability that every bin gets at least 
one ball?] 


588 


Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 
You can buy this book for 30 pounds or $50. See http://www. inference.phy.cam.ac.uk/mackay/itila/ for links. 


50 


Digital Fountain Codes 


Digital fountain codes are record-breaking sparse-graph codes for channels 
with erasures. 

Channels with erasures are of great importance. For example, files sent 
over the internet are chopped into packets, and each packet is either received 
without error or not received. A simple channel model describing this situation 
is a q-ary erasure channel, which has (for all inputs in the input alphabet 
{0,1,2,...,q—1}) a probability 1 — f of transmitting the input without error, 
and probability f of delivering the output ‘?’. The alphabet size q is 2', where 
lis the number of bits in a packet. 

Common methods for communicating over such channels employ a feed- 
back channel from receiver to sender that is used to control the retransmission 
of erased packets. For example, the receiver might send back messages that 
identify the missing packets, which are then retransmitted. Alternatively, the 
receiver might send back messages that acknowledge each received packet; the 
sender keeps track of which packets have been acknowledged and retransmits 
the others until all packets have been acknowledged. 

These simple retransmission protocols have the advantage that they will 
work regardless of the erasure probability f, but purists who have learned their 
Shannon theory will feel that these retransmission protocols are wasteful. If 
the erasure probability f is large, the number of feedback messages sent by 
the first protocol will be large. Under the second protocol, it’s likely that the 
receiver will end up receiving multiple redundant copies of some packets, and 
heavy use is made of the feedback channel. According to Shannon, there is no 
need for the feedback channel: the capacity of the forward channel is (1 — f)l 
bits, whether or not we have feedback. 

The wastefulness of the simple retransmission protocols is especially evi- 
dent in the case of a broadcast channel with erasures — channels where one 
sender broadcasts to many receivers, and each receiver receives a random 
fraction (1 — f) of the packets. If every packet that is missed by one or more 
receivers has to be retransmitted, those retransmissions will be terribly re- 
dundant. Every receiver will have already received most of the retransmitted 
packets. 

So, we would like to make erasure-correcting codes that require no feed- 
back or almost no feedback. The classic block codes for erasure correction are 
called Reed-Solomon codes. An (N, K) Reed-Solomon code (over an alpha- 
bet of size q = 2!) has the ideal property that if any K of the N transmitted 
symbols are received then the original K source symbols can be recovered. 
[See Berlekamp (1968) or Lin and Costello (1983) for further information; 
Reed-Solomon codes exist for N < q.] But Reed-Solomon codes have the 
disadvantage that they are practical only for small K, N, and q: standard im- 
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plementations of encoding and decoding have a cost of order K(N—K) logy N 
packet operations. Furthermore, with a Reed-Solomon code, as with any block 
code, one must estimate the erasure probability f and choose the code rate 
R= K/N before transmission. If we are unlucky and f is larger than expected 
and the receiver receives fewer than K symbols, what are we to do? We’d like 
a simple way to extend the code on the fly to create a lower-rate (N’, K) code. 
For Reed-Solomon codes, no such on-the-fly method exists. 

There is a better way, pioneered by Michael Luby (2002) at his company 
Digital Fountain, the first company whose business is based on sparse-graph 
codes. 

The digital fountain codes I describe here, LT codes, were invented by 
Luby in 1998. The idea of a digital fountain code is as follows. The encoder is LT stands for ‘Luby transform’. 
a fountain that produces an endless supply of water drops (encoded packets); 
let’s say the original source file has a size of Kl bits, and each drop contains 
l encoded bits. Now, anyone who wishes to receive the encoded file holds a 
bucket under the fountain and collects drops until the number of drops in the 
bucket is a little larger than K. They can then recover the original file. 

Digital fountain codes are rateless in the sense that the number of encoded 
packets that can be generated from the source message is potentially limitless; 
and the number of encoded packets generated can be determined on the fly. 
Regardless of the statistics of the erasure events on the channel, we can send 
as many encoded packets as are needed in order for the decoder to recover 
the source data. The source data can be decoded from any set of K’ encoded 
packets, for K’ slightly larger than K (in practice, about 5% larger). 

Digital fountain codes also have fantastically small encoding and decod- 
ing complexities. With probability 1 — ô, K packets can be communicated 
with average encoding and decoding costs both of order K ln(K/ô) packet 
operations. 

Luby calls these codes universal because they are simultaneously near- 
optimal for every erasure channel, and they are very efficient as the file length 
K grows. The overhead K’ — K is of order VK (In(K/6))?. 


> 50.1 A digital fountain’s encoder 


Each encoded packet tn is produced from the source file s15953...5K as 
follows: 


1. Randomly choose the degree d,, of the packet from a degree distri- 
bution p(d); the appropriate choice of p depends on the source file 


size K, as we’ll discuss later. 


. Choose, uniformly at random, dn distinct input packets, and set tn 
equal to the bitwise sum, modulo 2 of those dn packets. This sum 
can be done by successively exclusive-or-ing the packets together. 





This encoding operation defines a graph connecting encoded packets to 
source packets. If the mean degree d is significantly smaller than K then the 
graph is sparse. We can think of the resulting code as an irregular low-density 
generator-matrix code. 

The decoder needs to know the degree of each packet that is received, and 
which source packets it is connected to in the graph. This information can 
be communicated to the decoder in various ways. For example, if the sender 
and receiver have synchronized clocks, they could use identical pseudo-random 
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number generators, seeded by the clock, to choose each random degree and Si 8) 83 
each set of connections. Alternatively, the sender could pick a random key, a) QR QD 
Kn, given which the degree and the connections are determined by a pseudo- i 

random process, and send that key in the header of the packet. As long as the + E 
packet size l is much bigger than the key size (which need only be 32 bits or 1011 
so), this key introduces only a small overhead cost. 1 
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> 50.2 The decoder i 














Decoding a sparse-graph code is especially easy in the case of an erasure chan- I 
nel. The decoder’s task is to recover s from t = Gs, where G is the matrix c) OOO 
associated with the graph. The simple way to attempt to solve this prob- LX 
lem is by message-passing. We can think of the decoding algorithm as the 
sum—product algorithm if we wish, but all messages are either completely un- 
certain messages or completely certain messages. Uncertain messages assert 
that a message packet są could have any value, with equal probability; certain 
messages assert that są has a particular value, with probability one. 

This simplicity of the messages allows a simple description of the decoding 
process. We’ll call the encoded packets {tn} check nodes. 





























1. Find a check node t,, that is connected to only one source packet 
Sp. (If there is no such check node, this decoding algorithm halts at 
this point, and fails to recover all the source packets.) 




















(a) Set sk = th. URORORO 
(b) Add są to all checks t, that are connected to s,: 








Figure 50.1. Example decoding for 
a digital fountain code with 

K = 3 source bits and N = 4 

(c) Remove all the edges connected to the source packet sg. encoded bits. 





ty t= tw +s, for all n’ such that Gy, = 1. (50.1) 


2. Repeat (1) until all {sọ} are determined. 





This decoding process is illustrated in figure 50.1 for a toy case where each 
packet is just one bit. There are three source packets (shown by the upper 
circles) and four received packets (shown by the lower check symbols), which 
have the values tıtət3t4 = 1011 at the start of the algorithm. 

At the first iteration, the only check node that is connected to a sole source 
bit is the first check node (panel a). We set that source bit sı accordingly 
(panel b), discard the check node, then add the value of sı (1) to the checks to 
which it is connected (panel c), disconnecting sı from the graph. At the start 
of the second iteration (panel c), the fourth check node is connected to a sole 
source bit, s2. We set sə to t4 (0, in panel d), and add sə to the two checks 
it is connected to (panel e). Finally, we find that two check nodes are both 
connected to s3, and they agree about the value of s3 (as we would hope!), 
which is restored in panel f. 


> 50.3 Designing the degree distribution 


The probability distribution p(d) of the degree is a critical part of the design: 
occasional encoded packets must have high degree (i.e., d similar to K) in 
order to ensure that there are not some source packets that are connected to 
no-one. Many packets must have low degree, so that the decoding process 
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can get started, and keep going, and so that the total number of addition 
operations involved in the encoding and decoding is kept small. For a given 
degree distribution p(d), the statistics of the decoding process can be predicted 
by an appropriate version of density evolution. 

Ideally, to avoid redundancy, we’d like the received graph to have the prop- 
erty that just one check node has degree one at each iteration. At each itera- 
tion, when this check node is processed, the degrees in the graph are reduced 
in such a way that one new degree-one check node appears. In expectation, 
this ideal behaviour is achieved by the ideal soliton distribution, 


p1) = 


Ady. = = (50.2) 





for d= 2,3,...,K. 
The expected degree under this distribution is roughly In K. 


Exercise 50.2.!7] Derive the ideal soliton distribution. At the first iteration 
(t = 0) let the number of packets of degree d be ho(d); show that (for 
d > 1) the expected number of packets of degree d that have their degree 
reduced to d — 1 is ho(d)d/K; and at the tth iteration, when t of the 
K packets have been recovered and the number of packets of degree d 
is hi(d), the expected number of packets of degree d that have their 
degree reduced to d — 1 is h+(d)d/(K — t). Hence show that in order 
to have the expected number of packets of degree 1 satisfy h:(1) = 1 
for all t € {0,... K — 1}, we must to start with have ho(1) = 1 and 
ho(2) = K/2; and more generally, hy(2) = (K — t)/2; then by recursion 
solve for ho(d) for d = 3 upwards. 


This degree distribution works poorly in practice, because fluctuations 
around the expected behaviour make it very likely that at some point in the 
decoding process there will be no degree-one check nodes; and, furthermore, a 
few source nodes will receive no connections at all. A small modification fixes 
these problems. 

The robust soliton distribution has two extra parameters, c and 6; it is 
designed to ensure that the expected number of degree-one checks is about 


S=cln(K/6)VK, (50.3) 


rather than 1, throughout the decoding process. The parameter 6 is a bound 
on the probability that the decoding fails to run to completion after a certain 
number K’ of packets have been received. The parameter c is a constant of 
order 1, if our aim is to prove Luby’s main theorem about LT codes; in practice 
however it can be viewed as a free parameter, with a value somewhat smaller 
than 1 giving good results. We define a positive function 


si for d=1,2,...(K/S)-1 
r(d)= 4 21n(S/6) ford=K/S (50.4) 
0 ford> K/S 


(see figure 50.2 and exercise 50.4 (p.594)) then add the ideal soliton distribu- 
tion p to T and normalize to obtain the robust soliton distribution, u: 


_ od) + r(d) 
pd) =O 


where Z = $4 p(d) + 7(d). The number of encoded packets required at the 
receiving end to ensure that the decoding can run to completion, with proba- 
bility at least 1 — ô, is K’ = KZ. 


(50.5) 
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Figure 50.2. The distributions 
p(d) and r(d) for the case 

K = 10000, c= 0.2, 6 = 0.05, 
which gives S = 244, K/S = 41, 
and Z ~ 1.3. The distribution 7 is 
largest at d= 1 and d = K/S. 
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Figure 50.3. The number of 
degree-one checks S (upper figure) 
and the quantity K’ (lower figure) 
as a function of the two 
parameters c and 6, for 

K = 10000. Luby’s main theorem 
proves that there exists a value of 
c such that, given K’ received 
packets, the decoding algorithm 
will recover the K source packets 
with probability 1 — ô. 
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Luby’s (2002) analysis explains how the small-d end of 7 has the role of 
ensuring that the decoding process gets started, and the spike in 7 at d= K/S 
is included to ensure that every source packet is likely to be connected to a 
check at least once. Luby’s key result is that (for an appropriate value of the il 
constant c) receiving K’ = K +21n($/6)S checks ensures that all packets can 
be recovered with probability at least 1 — 6. In the illustrative figures I have 
set the allowable decoder failure probability ô quite large, because the actual L 
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failure probability is much smaller than is suggested by Luby’s conservative | 
analysis. ines. 
In practice, LT codes can be tuned so that a file of original size K ~ 10000 100003 T000 411000 OO, r 12000 


packets is recovered with an overhead of about 5%. Figure 50.4 shows his- 
tograms of the actual number of packets required for a couple of settings of 
the parameters, achieving mean overheads smaller than 5% and 10% respec- 





tively. 10000 10500 11000 11500 12000 
> 50.4 Applications Figure 50.4. Histograms of the 
actual number of packets N 
Digital fountain codes are an excellent solution in a wide variety of situations. required in order to recover a file 


of size K = 10000 packets. The 
parameters were as follows: 
top histogram: c = 0.01, ô = 0.5 
Storage (S = 10, K/S = 1010, and 

_ £1.01); 
You wish to make a backup of a large file, but you are aware that your magnetic middle: ¢ 0.03, 5 = 0.5 (S = 30, 


tapes and hard drives are all unreliable in the sense that catastrophic failures, K/S = 337, and Z ~ 1.03); 
in which some stored packets are permanently lost within one device, occur at bottom: c= 0.1, 6 = 0.5 (S = 99, 
a rate of something like 107? per day. How should you store your file? K/S = 101, and Z ~ 1.1). 
A digital fountain can be used to spray encoded packets all over the place, 
on every storage device available. Then to recover the backup file, whose size 
was K packets, one simply needs to find K’ ~ K packets from anywhere. 
Corrupted packets do not matter; we simply skip over them and find more 
packets elsewhere. 
This method of storage also has advantages in terms of speed of file re- 
covery. In a hard drive, it is standard practice to store a file in successive 
sectors of a hard drive, to allow rapid reading of the file; but if, as occasion- 
ally happens, a packet is lost (owing to the reading head being off track for 
a moment, giving a burst of errors that cannot be corrected by the packet’s 
error-correcting code), a whole revolution of the drive must be performed to 
bring back the packet to the head for a second read. The time taken for one 
revolution produces an undesirable delay in the file system. 
If files were instead stored using the digital fountain principle, with the 
digital drops stored in one or more consecutive sectors on the drive, then one 
would never need to endure the delay of re-reading a packet; packet loss would 
become less important, and the hard drive could consequently be operated 
faster, with higher noise level, and with fewer resources devoted to noisy- 
channel coding. 


Let’s mention two. 














> Exercise 50.3.!?! Compare the digital fountain method of robust storage on 
multiple hard drives with RAID (the redundant array of independent 
disks). 


Broadcast 


Imagine that ten thousand subscribers in an area wish to receive a digital 
movie from a broadcaster. The broadcaster can send the movie in packets 
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over a broadcast network — for example, by a wide-bandwidth phone line, or 
by satellite. 

Imagine that not all packets are received at all the houses. Let’s say 
f = 0.1% of them are lost at each house. In a standard approach in which the 
file is transmitted as a plain sequence of packets with no encoding, each house 
would have to notify the broadcaster of the fK missing packets, and request 
that they be retransmitted. And with ten thousand subscribers all requesting 
such retransmissions, there would be a retransmission request for almost every 
packet. Thus the broadcaster would have to repeat the entire broadcast twice 
in order to ensure that most subscribers have received the whole movie, and 
most users would have to wait roughly twice as long as the ideal time before 
the download was complete. 

If the broadcaster uses a digital fountain to encode the movie, each sub- 
scriber can recover the movie from any K’ ~ K packets. So the broadcast 
needs to last for only, say, 1.1K packets, and every house is very likely to have 
successfully recovered the whole file. 

Another application is broadcasting data to cars. Imagine that we want to 
send updates to in-car navigation databases by satellite. There are hundreds 
of thousands of vehicles, and they can receive data only when they are out 
on the open road; there are no feedback channels. A standard method for 
sending the data is to put it in a carousel, broadcasting the packets in a fixed 
periodic sequence. ‘Yes, a car may go through a tunnel, and miss out on a 
few hundred packets, but it will be able to collect those missed packets an 
hour later when the carousel has gone through a full revolution (we hope); or 
maybe the following day...’ 

If instead the satellite uses a digital fountain, each car needs to receive 
only an amount of data equal to the original file size (plus 5%). 


Further reading 


The encoders and decoders sold by Digital Fountain have even higher efficiency 
than the LT codes described here, and they work well for all blocklengths, not 
only large lengths such as K 2 10000. Shokrollahi (2003) presents raptor 
codes, which are an extension of LT codes with linear-time encoding and de- 
coding. 


> 50.5 Further exercises 


> Exercise 50.4.1] Understanding the robust soliton distribution. 


Repeat the analysis of exercise 50.2 (p.592) but now aim to have the 
expected number of packets of degree 1 be h;(1) = 1+ S for all t, instead 
of 1. Show that the initial required number of packets is 


K S 
ho(d) = ———-+ — ford>l. 50.6 
o(d) da) a or d > ( ) 
The reason for truncating the second term beyond d = K/S and replac- 
ing it by the spike at d = K/S (see equation (50.4)) is to ensure that 
the decoding complexity does not grow larger than O(K ln K). 


Estimate the expected number of packets >, Ao(d) and the expected 
number of edges in the sparse graph 5°, ho(d)d (which determines the 
decoding complexity) if the histogram of packets is as given in (50.6). 
Compare with the expected numbers of packets and edges when the 
robust soliton distribution (50.4) is used. 
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Exercise 50.5.4] Show that the spike at d= K/S (equation (50.4)) is an ade- 
quate replacement for the tail of high-weight packets in (50.6). 


Exercise 50.6.[9¢] Investigate experimentally how necessary the spike at d = 
K/S (equation (50.4)) is for successful decoding. Investigate also whether 
the tail of p(d) beyond d = K/S is necessary. What happens if all high- 
weight degrees are removed, both the spike at d = K/S and the tail of 
p(d) beyond d = K/S? 


Exercise 50.7.4] Fill in the details in the proof of Luby’s main theorem, that 
receiving K’ = K +21n(S/65)S checks ensures that all the source packets 
can be recovered with probability at least 1 — ô. 


Exercise 50.8. 4C] Optimize the degree distribution of a digital fountain code 
for a file of K = 10000 packets. Pick a sensible objective function for 
your optimization, such as minimizing the mean of N, the number of 
packets required for complete decoding, or the 95th percentile of the 
histogram of N (figure 50.4). 


> Exercise 50.9.13] Make a model of the situation where a data stream is broad- 
cast to cars, and quantify the advantage that the digital fountain has 
over the carousel method. 


Exercise 50.10.1?] Construct a simple example to illustrate the fact that the 
digital fountain decoder of section 50.2 is suboptimal — it sometimes 
gives up even though the information available is sufficient to decode 
the whole file. How does the cost of the optimal decoder compare? 


> Exercise 50.11.!°] If every transmitted packet were created by adding together 
source packets at random with probability 1/2 of each source packet’s 
being included, show that the probability that K’ = K received packets 
suffice for the optimal decoder to be able to recover the K source packets 
is just a little below 1/2. [To put it another way, what is the probability 
that a random K x K matrix has full rank?| 


Show that if K’ = K + A packets are received, the probability that they 
will not suffice for the optimal decoder is roughly 2~4. 


> Exercise 50.12.14] Implement an optimal digital fountain decoder that uses 
the method of Richardson and Urbanke (2001b) derived for fast encod- 
ing of sparse-graph codes (section 47.7) to handle the matrix inversion 
required for optimal decoding. Now that you have changed the decoder, 
you can reoptimize the degree distribution, using higher-weight packets. 
By how much can you reduce the overhead? Confirm the assertion that 
this approach makes digital fountain codes viable as erasure-correcting 
codes for all blocklengths, not just the large blocklengths for which LT 
codes are excellent. 


> Exercise 50.13.18] Digital fountain codes are excellent rateless codes for erasure 
channels. Make a rateless code for a channel that has both erasures and 
noise. 
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> 50.6 Summary of sparse-graph codes 


A simple method for designing error-correcting codes for noisy channels, first 
pioneered by Gallager (1962), has recently been rediscovered and generalized, 
and communication theory has been transformed. The practical performance 
of Gallager’s low-density parity-check codes and their modern cousins is vastly 
better than the performance of the codes with which textbooks have been filled 
in the intervening years. 

Which sparse-graph code is ‘best’ for a noisy channel depends on the cho- 
sen rate and blocklength, the permitted encoding and decoding complexity, 
and the question of whether occasional undetected errors are acceptable. Low- 
density parity-check codes are the most versatile; it’s easy to make a compet- 
itive low-density parity-check code with almost any rate and blocklength, and 
low-density parity-check codes virtually never make undetected errors. 

For the special case of the erasure channel, the sparse-graph codes that are 
best are digital fountain codes. 


> 50.7 Conclusion 


The best solution to the communication problem is: 


Combine a simple, pseudo-random code 


with a message-passing decoder. 
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A 


Notation 


What does P(A| B,C) mean? P(A| B,C) is pronounced ‘the probability 
that A is true given that B is true and C is true’. Or, more briefly, ‘the 
probability of A given B and C’. (See Chapter 2, p.22.) 


What do log and In mean? In this book, logz means the base-two loga- 
rithm, logy x; ln x means the natural logarithm, log, x. 


What does § mean? Usually, a ‘hat’ over a variable denotes a guess or es- 
timator. So § is a guess at the value of s. 


Integrals. There is no difference between f f(u) du and fdu f(u). The inte- 
grand is f(u) in both cases. 


N 
What does II mean? This is like the summation ey but it denotes a 


n=1 
product. It’s pronounced ‘product over n from 1 to N’. So, for example, 


N N 
[]n= 152% 9% =N= exp] ton] (A.1) 


n=1 n=1 


I like to choose the name of the free variable in a sum or a product — 
here, n — to be the lower case version of the range of the sum. So n 
usually runs from 1 to N, and m usually runs from 1 to M. This is a 
habit I learnt from Yaser Abu-Mostafa, and I think it makes formulae 
easier to understand. 


N 
What does mean? This is pronounced ‘N choose n’, and it is the 
n 


number of ways of selecting an unordered set of n objects from a set of 


size N. 
ee = Wot a 


This function is known as the combination function. 


What is T(x)? The gamma function is defined by T(x) = f° duu? te, 
for x > 0. The gamma function is an extension of the factorial function 
to real number arguments. In general, r(x +1) = T(x), and for integer 
arguments, T(x + 1) = a!. The digamma function is defined by W(x) = 
£ In D(a) 
da 2 


For large x (for practical purposes, 0.1 < x < oo), 
InI'(x) ~ (x — 2) ln(x) — z + $n 2x + O(1/2); (A.3) 
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and for small x (for practical purposes, 0 < x < 0.5): 
InT(2) = n= — yez + O(a?) (A.4) 
where ye is Euler’s constant. 
What does H3 '(1 — R/C) mean? Just as sin~!(s) denotes the inverse func- 
tion to s = sin(x), so Hy'(h) is the inverse function to h = Ho(z). 


There is potential confusion when people use sin? x to denote (sin a)”, 
since then we might expect sin7!s to denote 1/sin(s); I therefore like 


to avoid using the notation sin? z. 


What does f'(x) mean? The answer depends on the context. Often, a 
‘prime’ is used to denote differentiation: 


d 
(xc) = — i A.5 
Fa) = = F(a) (A.5) 
similarly, a dot denotes differentiation with respect to time, t: 
d 
t=—27. A. 
ae (A.6) 


However, the prime is also a useful indicator for ‘another variable’, for 
example ‘a new value for a variable’. So, for example, x’ might denote 
‘the new value of x’. Also, if there are two integers that both range from 
1 to N, I will often name those integers n and n’. 


So my rule is: if a prime occurs in an expression that could be a func- 
tion, such as f'(x) or h’(y), then it denotes differentiation; otherwise it 
indicates ‘another variable’. 


What is the error function? Definitions of this function vary. I define it 
to be the cumulative probability of a standard (variance = 1) normal 
distribution, 

zZ 
(z) = | exp(—2?/2)/V2r dz. (A.7) 
—0o 

What does €(r) mean? €[r] is pronounced ‘the expected value of r’ or ‘the 
expectation of r’, and it is the mean value of r. Another symbol for 
‘expected value’ is the pair of angle-brackets, (r). 


What does |x| mean? The vertical bars ‘| - |’ have two meanings. If A is a 
set, then |A| denotes the number of elements in the set; if x is a number, 
then |x| is the absolute value of x. 


What does [A|P] mean? Here, A and P are matrices with the same num- 
ber of rows. [A|P] denotes the double-width matrix obtained by putting 
A alongside P. The vertical bar is used to avoid confusion with the 
product AP. 


What does x' mean? The superscript t is pronounced ‘transpose’. Trans- 
posing a row-vector turns it into a column vector: 


1 
(1,2,39 |! 2 
3 


, (A.8) 


and vice versa. [Normally my vectors, indicated by bold face type (x), 
are column vectors.| 


Similarly, matrices can be transposed. If Mj; is the entry in row 7 and 


column j of matrix M, and N = M', then Nj; = Mij. 
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What are TraceM and det M? The trace of a matrix is the sum of its di- 
agonal elements, 


Trace M = 5 Miu. (A.9) 


The determinant of M is denoted det M. 


What does 6, mean? The 6 matrix is the identity matrix. 


ie ae! 1 ifm=n 
m0 ifm#én. 


Another name for the identity matrix is I or 1. Sometimes I include a 
subscript on this symbol — 1x — which indicates the size of the matrix 
(K x Kk). 


What does ô(x) mean? The delta function has the property 


pe f(x)5(x) = (0). (A.10) 


Another possible meaning for 6(S') is the truth function, which is 1 if the 
proposition S$ is true but I have adopted another notation for that. After 
all, the symbol 6 is quite busy already, with the two roles mentioned above 
in addition to its role as a small real number 6 and an increment operator 
(as in 6x)! 


What does 1[S] mean? 1[S] is the truth function, which is 1 if the propo- 
sition S is true and 0 otherwise. For example, the number of positive 
numbers in the set T = {—2,1,3} can be written 


X ie > Oj. (A.11) 


ZEF 


What is the difference between ‘:=’ and ‘=’? In an algorithm, z := y 
means that the variable x is updated by assigning it the value of y. 


In contrast, x = y is a proposition, a statement that x is equal to y. 


See Chapters 23 and 29 for further definitions and notation relating to 
probability distributions. 


A — Notation 
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B 


Some Physics 


> B.1 About phase transitions 


A system with states x in contact with a heat bath at temperature T = 1/8 
has probability distribution 


P(x |) = FH PE). (B.1) 


The partition function is 


Z(8) = Ñ exp(—BE(x)). (B.2) 


x 


The inverse temperature 8 can be interpreted as defining an exchange rate 
between entropy and energy. (1/8) is the amount of energy that must be 
given to a heat bath to increase its entropy by one nat. 

Often, the system will be affected by some other parameters such as the 
volume of the box it is in, V, in which case Z is a function of V too, 7(G,V). 

For any system with a finite number of states, the function Z(() is evi- 
dently a continuous function of 8, since it is simply a sum of exponentials. 
Moreover, all the derivatives of Z(G) with respect to 8 are continuous too. 

What phase transitions are all about, however, is this: phase transitions 
correspond to values of 3 and V (called critical points) at which the derivatives 
of Z have discontinuities or divergences. 

Immediately we can deduce: 





Only systems with an infinite number of states can show phase 


transitions. 





Often, we include a parameter N describing the size of the system. Phase 
transitions may appear in the limit N — oo. Real systems may have a value 
of N like 10”. 

If we make the system large by simply grouping together N independent 
systems whose partition function is Za) (3), then nothing interesting happens. 
The partition function for N independent identical systems is simply 


Zin (b) = Zo O”. (B.3) 


Now, while this function Zy) (6) may be a very rapidly varying function of 8, 
that doesn’t mean it is showing phase transitions. The natural way to look at 
the partition function is in the logarithm 
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Duplicating the original system N times simply scales up all properties like 
the energy and heat capacity of the system by a factor of N. So if the original 
system showed no phase transitions then the scaled up system won’t have any 
either. 


Only systems with long-range correlations show phase transitions. 


Long-range correlations do not require long-range energetic couplings; for 
example, a magnet has only short-range couplings (between adjacent spins) 
but these are sufficient to create long-range order. 


Why are points at which derivatives diverge interesting? 


The derivatives of In Z describe properties like the heat capacity of the sys- 
tem (that’s the second derivative) or its fluctuations in energy. If the second 
derivative of In Z diverges at a temperature 1/3, then the heat capacity of the 
system diverges there, which means it can absorb or release energy without 
changing temperature (think of ice melting in ice water); when the system is 
at equilibrium at that temperature, its energy fluctuates a lot, in contrast to 
the normal law-of-large-numbers behaviour, where the energy only varies by 
one part in JN , 


A toy system that shows a phase transition 


Imagine a collection of N coupled spins that have the following energy as a 
function of their state x € {0,1}. 


_ f -—Ne x= (0,0,0,...,0) 
Elx) = { 0 otherwise. BO) 


This energy function describes a ground state in which all the spins are aligned 
in the zero direction; the energy per spin in this state is —e. if any spin 
changes state then the energy is zero. This model is like an extreme version 
of a magnetic interaction, which encourages pairs of spins to be aligned. 

We can contrast it with an ordinary system of N independent spins whose 
energy is: 

E(x) = 5 (22n — 1). (B.6) 
n 

Like the first system, the system of independent spins has a single ground 
state (0,0,0,...,0) with energy — Ne, and it has roughly 2% states with energy 
very close to 0, so the low-temperature and high-temperature properties of the 
independent-spin system and the coupled-spin system are virtually identical. 

The partition function of the coupled-spin system is 


Z(B) = 6N 42N — 1. (B.7) 


The function 
In Z(8) = In (a +20- 1) (B.8) 


is sketched in figure B.la along with its low temperature behaviour, 
InZ(B) ~ Nbe, Bow, (B.9) 
and its high temperature behaviour, 


InZ(8)~NIn2, 60. (B.10) 
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Figure B.1. (a) Partition function 







N beta a se WANE) Noe ee of toy system which shows a phase 
N log (2) transition for large N. The arrow 


marks the point Be = log 2/e. (b) 
The same, for larger N. 

(c) The variance of the energy of 
the system as a function of 8 for 
two system sizes. As N increases 
the variance has an increasingly 
sharp peak at the critical point Be. 
Contrast with figure B.2. 











beta 





(a) 








log Z ——— 
N beta epsilon -------- 
N log (2) 








beta 
(b) 
Figure B.2. The partition function 
a (a) and energy-variance (b) of a 
N og (2) system consisting of N 


independent spins. The partition 
function changes gradually from 
one asymptote to the other, 
regardless of how large N is; the 
variance of the energy does not 
have a peak. The fluctuations are 

beta beta largest at high temperature (small 
(a) (b) 3) and scale linearly with system 
size N. 














The arrow marks the point 
In2 
B= 2 (B.11) 
€ 

at which these two asymptotes intersect. In the limit N — oo, the graph of 
In Z(6) becomes more and more sharply bent at this point (figure B.1b). 

The second derivative of In Z, which describes the variance of the energy 
of the system, has a peak value, at 8 = In2/e, roughly equal to 


N? e 
1 
which corresponds to the system spending half of its time in the ground state 
and half its time in the other states. 

At this critical point, the heat capacity of this system is thus proportional 
to N?; the heat capacity per spin is proportional to N, which, for infinite N, is 
infinite, in contrast to the behaviour of systems away from phase transitions, 
whose capacity per atom is a finite number. 

For comparison, figure B.2 shows the partition function and energy-variance 
of the ordinary independent-spin system. 





(B.12) 


More generally 


Phase transitions can be categorized into ‘first-order’ and ‘continuous’ transi- 
tions. In a first-order phase transition, there is a discontinuous change of one 
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or more order-parameters; in a continuous transition, all order-parameters 
change continuously. [What’s an order-parameter? — a scalar function of the 
state of the system; or, to be precise, the expectation of such a function.] 

In the vicinity of a critical point, the concept of ‘typicality’ defined in 
Chapter 4 does not hold. For example, our toy system, at its critical point, 
has a 50% chance of being in a state with energy —Ne, and roughly a 1/2 NRL 
chance of being in each of the other states that have energy zero. It is thus not 
the case that In 1/P(x) is very likely to be close to the entropy of the system 
at this point, unlike a system with N i.i.d. components. 

Remember that information content (In 1/P(x)) and energy are very closely 
related. If typicality holds, then the system’s energy has negligible fluctua- 
tions, and vice versa. 
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C 


Some Mathematics 


> C.1 Finite field theory 


Most linear codes are expressed in the language of Galois theory 


Why are Galois fields an appropriate language for linear codes? First, a defi- 
nition and some examples. 


A field F is a set F = {0, F'} such that 


1. F forms an Abelian group under an addition operation ‘+’, with + 
0 being the identity; [Abelian means all elements commute, i.e., 0 
satisfy a + b = b+ a.] 1 





= O| oO 
O =|= 
re Oo 

OO) oS 
= oje 


2. F’ forms an Abelian group under a multiplication operation ‘.’; 


multiplication of any element by 0 yields 0; Table C.1. Addition and 
3. these operations satisfy the distributive rule (a+ b)-c=a-c+b-c. multiplication tables for GF (2). 








For example, the real numbers form a field, with ‘+’ and ‘.’ denoting 
ordinary addition and multiplication. +/0 1 AB 
0/0 1 AB 
A Galois field GF(q) is a field with a finite number of elements q. 111 OBA 
A unique Galois field exists for any q = p™, where p is a prime number AJA B 0 1 
and m is a positive integer; there are no other finite fields. BIB A 1 0 
GF(2). The addition and multiplication tables for GF(2) are shown in ta- 7 > > “ = 
ble C.1. These are the rules of addition and multiplication modulo 2. 11/0 1 AB 
GF (p). For any prime number p, the addition and multiplication rules are : a : F 





those for ordinary addition and multiplication, modulo p. 


GF (4). The rules for GF(p™), with m > 1, are not those of ordinary addition Table C.2. Addition and 
and multiplication. For example the tables for GF (4) (table C.2) are not multiplication tables for GF(4). 
the rules of addition and multiplication modulo 4. Notice that 1+1 = 0, 
for example. So how can GF(4) be described? It turns out that the 
elements can be related to polynomials. Consider polynomial functions = —___ 
of x of degree 1 and with coefficients that are elements of GF(2). The Element Polynomial Bit pattern 
polynomials shown in table C.3 obey the addition and multiplication 0 0 00 





rules of GF'(4) if addition and multiplication are modulo the polynomial 1 1 01 
x? +a+1, and the coefficients of the polynomials are from GF(2). For f F : 1 z 


example, B- B = z? + (1 +1)x +1 = xz = A. Each element may also be 
represented as a bit pattern as shown in table C.3, with addition being 
bitwise modulo 2, and multiplication defined with an appropriate carry Table C.3. Representations of the 


operation. elements of GF (4). 
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GF(8). We can denote the elements of GF(8) by {0,1, A,B,C, D, E, F}. Each 
element can be mapped onto a polynomial over GF(2). The multiplica- 
tion and addition operations are given by multiplication and addition of 
the polynomials, modulo z + z +1. The multiplication table is given 








below. 
element polynomial binary representation .l0 1 ABCDEF 
0 0 000 0;0 0 0 0 0 0 0 +0 
1 1 001 1};0 1 ABC DEF 
A x 010 AIO A C FE B11 F D 
B x+1 O11 B|O B EDF C 1 A 
C x 100 C0 AO BFEADI1 
D ge 1 101 DG D1 CAFBE 
E e+e 110 EJO E F 1 DBAC 
F r? +r+1 111 PO: F DA81L1ECB 





Why are Galois fields relevant to linear codes? Imagine generalizing a binary 
generator matrix G and binary vector s to a matrix and vector with elements 
from a larger set, and generalizing the addition and multiplication operations 
that define the product Gs. In order to produce an appropriate input for 
a symmetric channel, it would be convenient if, for random s, the product 
Gs produced all elements in the enlarged set with equal probability. This 
uniform distribution is easiest to guarantee if these elements form a group 
under both addition and multiplication, because then these operations do not 
break the symmetry among the elements. When two random elements of a 
multiplicative group are multiplied together, all elements are produced with 
equal probability. This is not true of other sets such as the integers, for which 
the multiplication operation is more likely to give rise to some elements (the 
composite numbers) than others. Galois fields, by their definition, avoid such 
symmetry-breaking effects. 


>» C.2 Eigenvectors and eigenvalues 


A right-eigenvector of a square matrix A is a non-zero vector er that satisfies 
Aer = Aer, (C.1) 


where A is the eigenvalue associated with that eigenvector. The eigenvalue 
may be a real number or complex number and it may be zero. Eigenvectors 
may be real or complex. 

A left-eigenvector of a matrix A is a vector e, that satisfies 


eLA = del. (C.2) 
The following statements for right-eigenvectors also apply to left-eigenvectors. 


e If a matrix has two or more linearly independent right-eigenvectors with 
the same eigenvalue then that eigenvalue is called a degenerate eigenvalue 
of the matrix, or a repeated eigenvalue. Any linear combination of those 
eigenvectors is another right-eigenvector with the same eigenvalue. 


e The principal right-eigenvector of a matrix is, by definition, the right- 
eigenvector with the largest associated eigenvalue. 


e If a real matrix has a right-eigenvector with complex eigenvalue À = 
x + yi then it also has a right-eigenvector with the conjugate eigenvalue 
M=ax- yl. 
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Symmetric matrices 
If A is a real symmetric N x N matrix then 
1. all the eigenvalues and eigenvectors of A are real; 


2. every left-eigenvector of A is also a right-eigenvector of A with the same 
eigenvalue, and vice versa; 


3. a set of N eigenvectors and eigenvalues {e(, Aara can be found that 
are orthonormal, that is, 


ee) = Sa; (C.3) 
the matrix can be expressed as a weighted sum of outer products of the 
eigenvectors: 

N 
A =X dale) [eT (C.4) 
a=1 


(Whereas I often use į and n as indices for sets of size I and N, I will use the 
indices a and b to run over eigenvectors, even if there are N of them. This is 
to avoid confusion with the components of the eigenvectors, which are indexed 


by n, e.g. el) 


General square matrices 


An N x N matrix can have up to N distinct eigenvalues. Generically, there 
are N eigenvalues, all distinct, and each has one left-eigenvector and one right- 
eigenvector. In cases where two or more eigenvalues coincide, for each distinct 
eigenvalue that is non-zero there is at least one left-eigenvector and one right- 
eigenvector. 
Left- and right-eigenvectors that have different eigenvalue are orthogonal, 
that is, 
if Aa Æ Av then e®-e® =0. (C.5) 


Non-negative matrices 


Definition. If all the elements of a non-zero matrix C satisfy Cmn > 0 then C 
is a non-negative matrix. Similarly, if all the elements of a non-zero vector c 
satisfy Cn > 0 then c is a non-negative vector. 
Properties. A non-negative matrix has a principal eigenvector that is non- 
negative. It may also have other eigenvectors with the same eigenvalue that 
are not non-negative. But if the principal eigenvalue of a non-negative matrix 
is not degenerate, then the matrix has only one principal eigenvector e), and 
it is non-negative. 

Generically, all the other eigenvalues are smaller in absolute magnitude. 
[There can be several eigenvalues of identical magnitude in special cases.] 


Transition probability matrices 


An important example of a non-negative matrix is a transition probability 
matrix Q. 

Definition. A transition probability matrix Q has columns that are probability 
vectors, that is, it satisfies Q > 0 and 


X Qij = 1 for all j. (C.6) 
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Matrix Eigenvalues and eigenvectors eL, er 
2.41 1 —0.41 
f A r .58 | |.82 0j 10 —.58| |—.82 
001 .82] |.58 0j J0 .82 .58 
0 0 1} |1 0 0 
01 1.62 —0.62 
k a 03] |.53 .85 .85 
85] |.85 —.53| |—.53 
1100 1.62 0.5+0.92 0.5—0.9% —0.62 
i 00 .60] | .60 .1—.5i .1—.5i 1+.5% L+.5% 37 37 
1000 37] | .37 —.3—.4i .3+.4i —.3+.4i .3—.4i .60 .60 
0011 37| | .37 3+.4t} | —.3—.4i .3—.4i] |—.3+.4i —.60| |—.60 
.60} |.60 —.1+.5i] |—.1+.5i —.l—.5i] |—.1—.5i 37 37 








Table C.4. Some matrices and their eigenvectors. 


Matrix Eigenvalues and eigenvectors eL, er 
1 —0.38 

| : bs | 71] |-36 —.93| |—.71 

i 71] |-93 .36 71 

1 —0.2—0.3% —0.2+0.3% 

o 
; a y 58] |.14 —.8+.11 .2—.5i —.8—.li 24.5% 
1 65 5A 58] | .41 —.2—.5¢] |—.6+.2i —.2+.5¢] |—.6—.2i 

R .58] |.90 24.2% A+ .3t .2—.2i .4—.3i 








Table C.5. Transition probability matrices for generating random paths through trellises. 


This property can be rewritten in terms of the all-ones vector n = (1,1,...,1)': 
n'Q =n". (C.7) 
So n is the principal left-eigenvector of Q with eigenvalue à; = 1. 
e® =n. (C.8) 


Because it is a non-negative matrix, Q has a principal right-eigenvector that 
is non-negative, e, Generically, for Markov processes that are ergodic, this 
eigenvector is the only right-eigenvector with eigenvalue of magnitude 1 (see 
table C.6 for illustrative exceptions). This vector, if we normalize it such that 
eg ln = 1, is called the invariant distribution of the transition probability 
matrix. It is the probability density that is left unchanged under Q. Unlike 
the principal left-eigenvector, which we explicitly identified above, we can’t 
usually identify the principal right-eigenvector without computation. 

The matrix may have up to N — 1 other right-eigenvectors all of which are 
orthogonal to the left-eigenvector n, that is, they are zero-sum vectors. 


> C.3 Perturbation theory 


Perturbation theory is not used in this book, but it is useful in this book’s 
fields. In this section we derive first-order perturbation theory for the eigen- 
vectors and eigenvalues of square, not necessarily symmetric, matrices. Most 
presentations of perturbation theory focus on symmetric matrices, but non- 
symmetric matrices (such as transition matrices) also deserve to be perturbed! 
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Matrix Eigenvalues and eigenvectors eL, er 
1 1 0.70 0.70 
yi ' : 0 0 71] |.89 45 71 0 0 
(a) 0 0 90 .20 0 0 71) |.45 —.89] |—.71 0 0 
0 0 10 80 -71] | .89 0 0 0 0 —.45] |—.71 
eee 71] |.45 0 0 0 0 89 71 
1 0.98 0.70 0.69 
= . r 50} | .87 —.18} | —.66 .20 63 —.19| | —.61 
(a’) 0 ‘01 88 20 50] 1.43 —.15| | —.28 —.40] | —.63 Al 65 
0 0 10 80 50] | .22 .66 61 —.40] | —.32 —.44| |—.35 
saa 50] |.11 72 33 .80 32 TT .30 
1 0.70 —0.70 —1 
o x : > 50} | .63 —.32 .50 .32| |—.50 .50 .63 
(b) 90 .20 0 0 50} | .32 63] | —.50 —.63 50 .50 .32 
10 ‘80 0 0 50} | .63 —.32 .50 —.32 50 —.50} | —.63 
pam 50} |.32 -63| | —.50 63] | —.50 —.50} | —.32 


Table C.6. Illustrative transition probability matrices and their eigenvectors showing the two ways of 
being non-ergodic. (a) More than one principal eigenvector with eigenvalue 1 because the 
state space falls into two unconnected pieces. (a’) A small perturbation breaks the degen- 
eracy of the principal eigenvectors. (b) Under this chain, the density may oscillate between 
two parts of the state space. In addition to the invariant distribution, there is another 
right-eigenvector with eigenvalue —1. In general such circulating densities correspond to 
complex eigenvalues with magnitude 1. 


We assume that we have an N x N matrix H that is a function H(e) of 
a real parameter €, with € = 0 being our starting point. We assume that a 
Taylor expansion of H(e) is appropriate: 


H(e) = H(0) +V +--- (C.9) 
where ae 
V= De (C.10) 


We assume that for all e of interest, H(e) has a complete set of N right- 
eigenvectors and left-eigenvectors, and that these eigenvectors and their eigen- 
values are continuous functions of e. This last assumption is not necessarily a 
good one: if H(0) has degenerate eigenvalues then it is possible for the eigen- 
vectors to be discontinuous in €; in such cases, degenerate perturbation theory 
is needed. That’s a fun topic, but let’s stick with the non-degenerate case 
here. 
We write the eigenvectors and eigenvalues as follows: 


Heeg (0) = A (ef (6), (C.11) 
and we Taylor-expand 
AO (e) = A (0) + ep +- (C.12) 
with 
po = 2n (C.13) 
and 


el (6) = e% (0) + fo) +e (C.14) 
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with 





5) = dey? 
R T Oe” 
and similar definitions for e% and po, We define these left-vectors to be row 
vectors, so that the ‘transpose’ operation is not needed and can be banished. 
We are free to constrain the magnitudes of the eigenvectors in whatever way 
we please. Each left-eigenvector and each right-eigenvector has an arbitrary 
magnitude. The natural constraints to use are as follows. First, we constrain 
the inner products with: 


(C.15) 


e (ee (6) =1, forala. (C.16) 
Expanding the eigenvectors in €, equation (C.19) implies 
(e (0) + ff +---)(e@ (0) + €@ +--+) =1, (C.17) 
from which we can extract the terms in e, which say: 
eO (0) £0 + Me (0) = 0 (C.18) 
We are now free to choose the two constraints: 
e (0) =0, £%e (0) =0, (C.19) 


which in the special case of a symmetric matrix correspond to constraining 
the eigenvectors to be of constant length, as defined by the Euclidean norm. 

OK, now that we have defined our cast of characters, what do the defining 
equations (C.11) and (C.9) tell us about our Taylor expansions (C.13) and 
(C.15)? We expand equation (C.11) in €. 


(H(0)-+eV+---)(e@ (0) ef 4---) = (AM 0) ten +---)(e@) (0) +f +---). 


(C.20) 
Identifying the terms of order e, we have: 
H(0)£ + Ve (0) = A (OEP + pe (0). (C.21) 


We can extract interesting results from this equation by hitting it with e® (0): 
eP OHO + eP O Vek (0) = eP (0A O6 + wet” Oek (0). 
=> Me” oO) + e (0)Ve (0) = A CO)e O + usa (C.22) 
Setting b = a we obtain 
e® (0) Ve (0) = pw. (C.23) 


Alternatively, choosing b 4 a, we obtain: 


eP (0) Ve? (0) = [X(0) — AYO) eP (OE? (C.24) 
= e (06 = —_* _ (Ve (0), (C.25) 


A@ (0) — A® (0) + 


Now, assuming that the right-eigenvectors {et ) (0) } , form a complete basis, 
we must be able to write 


fO =Y wyeQ(0), (C.26) 
b 
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where 
wy = eo), (C.27) 
so, comparing (C.25) and (C.27), we have: 


b a 
el (0) Ve% (0) (b) 


pN eR 28) (6), (C.28) 
R 2 NIO ADO R 


Equations (C.23) and (C.28) are the solution to the first-order perturbation 
theory problem, giving respectively the first derivative of the eigenvalue and 
the eigenvectors. 


Second-order perturbation theory 


If we expand the eigenvector equation (C.11) to second order in €, and assume 
that the equation 
H(e) = H(0) + €V (C.29) 


is exact, that is, H is a purely linear function of €, then we have: 


a a 1 a 
(H(0) + eV) (eR (0) + ef” + SP aR? +>) 





= (90) + en + berd 4.-..)(e (0) + EP + de2g“@ +---) (6.30) 


where g® and v(®) are the second derivatives of the eigenvector and eigenvalue. 
Equating the second-order terms in € in equation (C.30), 


a 1 a Erig a 1 a (a a) g(a 
veo + 5H(0)gR = 5X (0) g@ + su! eo (0) + uO. (0.31) 


Hitting this equation on the left with e® (0), we obtain: 


a a 1 a a a 
e 0) VE + ADe Og 


3X (ef (gE? + ir Oe Oe? (0) + ue 0. (C.32) 
ie 


The term e (0) is equal to zero because of our constraints (C.19), so 


a a 1 
e (ovt = Bu (C.33) 
so the second derivative of the eigenvalue with respect to € is given by 


b a 
e® (0)ve® (0) 


(0) 
Re (0) (C.34) 
Zz A® (0) — A® (0) $ 


ivo = (OV 
(b) (a) (a) (b) 
2S ler O Vek Olei (Vek W (C.35) 


a b 
2 X0) — AP (0) 


This is as far as we will take the perturbation expansion. 


Summary 


If we introduce the abbreviation Vba for e® (0) Ve? (0), we can write the 
eigenvectors of H(e) = H(0) + €V to first order as 


a a Vba b 
ek (9 = eR (0) +e) sa er) + (C.36) 
bAa 


and the eigenvalues to second order as 


MY (€) = (0) + Vaa te? >> 
b4a 


Voa Vab 


XO) voy (C.37) 
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> C.4 Some numbers 


91000 
9500 


9200 


9100 


930 


920 


9710 


9720 


2—30 


2—60 


98192 
91024 


9469 
9266 


9190 
9171 


998 


958 


982 


192466 


19308 
19301 
3x 10150 


10141 
1080 
1.6 x 1060 
1057 
3x 10° 
1080 


3x 1029 


3x10!” 
1015 


1012 


104 
1911 
3x 1010 
6x 109 
6x 109 
10° 


2.5 x 108 
2x 108 
2x 108 
3x 10° 
2x107 
2x107 

10” 
4x 10° 
10° 


2x 10° 
3 x 104 
3 x 104 
1.5x 108 
108 
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Number of distinct 1-kilobyte files 
Number of states of a 2D Ising model with 32x32 spins 
Number of binary strings of length 1000 


Number of binary strings of length 1000 having 100 1s and 900 Os 
Number of electrons in universe 


Number of electrons in solar system 
Number of electrons in the earth 








Age of universe/picoseconds 


Age of universe/seconds 


Number of neurons in human brain 
Number of bits stored on a DVD 
Number of bits in the wheat genome 
Number of bits in the human genome 
Population of earth 


Number of fibres in the corpus callosum 

Number of bits in C. Elegans (a worm) genome 

Number of bits in Arabidopsis thaliana (a flowering plant related to broccoli) genome 
One year /seconds 

Number of bits in the compressed PostScript file that is this book 

Number of bits in unix kernel 

Number of bits in the E. Coli genome, or in a floppy disk 

Number of years since human/chimpanzee divergence 


1048 576 


Number of generations since human/chimpanzee divergence 
Number of genes in human genome 

Number of genes in Arabidopsis thaliana genome 

Number of base pairs in a gene 

210 — 1024; e” = 1096 


1 











Lifetime probability of dying from smoking one pack of cigarettes per day. 
Lifetime probability of dying in a motor vehicle accident 


Lifetime probability of developing cancer because of drinking 2 litres per day of 
water containing 12 p.p.b. benzene 


Probability of error in transmission of coding DNA, per nucleotide, per generation 


Probability of undetected error in a hard disk drive, after error correction 
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complexity control, 289, 346, 347, 349 
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compression, see source code 
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error-correcting codes, 16, 21, 
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399 
variable-length code, 249 
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469, 505 
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conventions, see notation 
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convolutional code, 184, 186, 574, 587 
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marginal entropy, 139, 140 
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cellular, see mobile phone 
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polymer, 257 
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porridge, 280 
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positive definite, 539 
positivity, 551 
posterior probability, 6, 152 
power cost, 180 
power law, 584 
practical, 183, see error-correcting 
code 
precision, 176, 181, 312, 320, 383 
precisions add, 181 
prediction, 29, 52 
predictive distribution, 111 
prefix code, 92, 95 
prior, 6, 308, 529 
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improper, 353 
Jeffreys, 316 
subjectivity, 30 
prior equivalence, 447 
priority of bits in a message, 239 
prize, on game show, 57 
probabilistic model, 111, 120 
probabilistic movie, 551 
probability, 26, 38 
Bayesian, 50 
contrasted with likelihood, 28 
density, 30, 33 
probability distributions, 311, see 
distribution 
probability of block error, 152 
probability propagation, see 
sum—product algorithm 
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proper, 539 
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protein, 204, 269 
regulatory, 201, 204 
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Punch, 448 
puncturing, 222, 580 
pupil, 553 
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R3, see repetition code 
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for compression, 231 
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random variable, 26, 463 
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suppression, 370, 387, 390, 395 
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rate, 152 
rate-distortion theory, 167 
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188, 190, 219, 593 
redundant constraints in code, 20 
Reed-Solomon code, 185, 571, 589 
regression, 342, 536 
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reinforcement learning, 453 
rejection, 364, 366, 533 
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retransmission, 589 
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saddle-point approximation, 341 
sailor, 307 
sample, 312, 356 
from Gaussian, 312 
sampler density, 362 
sampling distribution, 459 
sampling theory, 38, 320 
criticisms, 32, 64 
sandwiching method, 419 
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scaling, 203 
Schönberg, 203 
Schottky anomaly, 404 
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secret, 200 
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security, 199, 201 
seek time, 593 
Sejnowski, Terry J., 522 
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separation, 242, 246 
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sequential decoding, 581 
sequential probability ratio test, 464 
sermon, see caution 
classical statistics, 64 
confidence level, 465 
dimensions, 180 
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importance sampling, 382 
interleaving, 189 
MAP method, 283, 306 
maximum entropy, 308 
maximum likelihood, 306 
most probable is atypical, 283 
p-value, 463 
sampling theory, 64 
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stopping rule, 463 
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worst-case-ism, 207 
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simplex, 173, 316 
Simpson’s paradox, 355 
Simpson, O.J., see wife-beaters 
simulated annealing, 379, 392, see 
annealing 
six, waiting for, 38 
Skilling, John, 392 
sleep, 524, 554 
Slepian—Wolf, see dependent sources 
slice sampling, 374 
multi-dimensional, 378 
soft K-means clustering, 289 
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software, xi 
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BUGS, 371 
Dasher, 119 
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Gaussian processes, 534 
hash function, 200 
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solar system, 346 
soldier, 241 
soliton distribution, 592 
sound, 187 
source code, 73, see compression, 
symbol code, arithmetic 
coding, Lempel—Ziv 
algorithms, 119, 121 
block code, 76 
block-sorting compression, 121 
Burrows—Wheeler transform, 121 
for complex sources, 353 
for constrained channel, 249, 255 
for integers, 132 
Huffman, see Huffman code 
implicit probabilities, 102 
optimal lengths, 97, 102 
prefix code, 95 
software, 121 
stream codes, 110-130 
supermarket, 96, 104, 112 
symbol code, 91 
uniquely decodeable, 94 
variable symbol durations, 125, 
256 
source coding theorem, 78, 91, 229, 
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density evolution, 566 
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spy, 464 
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standard deviation, 320 
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stiffness, 289 
Stirling’s approximation, 1, 8 
stochastic, 472 
stochastic dynamics, see Hamiltonian 
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telephone, see phone 
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temperature, 392, 601 
termination, 579 
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timing, 187 
training data, 529 
transatlantic, 173 
transfer matrix method, 407 
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typical set, 80, 154, 363 
for compression, 80 
for noisy channel, 154 
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